Distributed and Sequantial Algorithms for Bioinformatics
Distributed and Sequantial Algorithms for Bioinformatics
K. Erciyes
Distributed
and Sequential
Algorithms for
Bioinformatics
Computational Biology
Volume 23
Editors-in-Chief
Andreas Dress
CAS-MPG Partner Institute for Computational Biology, Shanghai, China
Michal Linial
Hebrew University of Jerusalem, Jerusalem, Israel
Olga Troyanskaya
Princeton University, Princeton, NJ, USA
Martin Vingron
Max Planck Institute for Molecular Genetics, Berlin, Germany
Editorial Board
Robert Giegerich, University of Bielefeld, Bielefeld, Germany
Janet Kelso, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
Gene Myers, Max Planck Institute of Molecular Cell Biology and Genetics, Dresden,
Germany
Pavel A. Pevzner, University of California, San Diego, CA, USA
Advisory Board
Gordon Crippen, University of Michigan, Ann Arbor, MI, USA
Joe Felsenstein, University of Washington, Seattle, WA, USA
Dan Gusfield, University of California, Davis, CA, USA
Sorin Istrail, Brown University, Providence, RI, USA
Thomas Lengauer, Max Planck Institute for Computer Science, Saarbrücken, Germany
Marcella McClure, Montana State University, Bozeman, MO, USA
Martin Nowak, Harvard University, Cambridge, MA, USA
David Sankoff, University of Ottawa, Ottawa, ON, Canada
Ron Shamir, Tel Aviv University, Tel Aviv, Israel
Mike Steel, University of Canterbury, Christchurch, New Zealand
Gary Stormo, Washington University in St. Louis, St. Louis, MO, USA
Simon Tavaré, University of Cambridge, Cambridge, UK
Tandy Warnow, University of Texas, Austin, TX, USA
Lonnie Welch, Ohio University, Athens, OH, USA
The Computational Biology series publishes the very latest, high-quality research
devoted to specific issues in computer-assisted analysis of biological data. The main
emphasis is on current scientific developments and innovative techniques in
computational biology (bioinformatics), bringing to light methods from mathemat-
ics, statistics and computer science that directly address biological problems
currently under investigation.
The series offers publications that present the state-of-the-art regarding the
problems in question; show computational biology/bioinformatics methods at work;
and finally discuss anticipated demands regarding developments in future
methodology. Titles can range from focused monographs, to undergraduate and
graduate textbooks, and professional text/reference works.
Distributed
and Sequential Algorithms
for Bioinformatics
123
K. Erciyes
Computer Engineering Department
Izmir University
Uckuyular, Izmir
Turkey
ISSN 1568-2684
Computational Biology
ISBN 978-3-319-24964-3 ISBN 978-3-319-24966-7 (eBook)
DOI 10.1007/978-3-319-24966-7
Recent technological advancements in the last few decades provided vast and
unprecedented amounts of biological data including data of DNA and cell,
and biological networks. This data comes in two basic formats as DNA nucleotide
and protein amino acid sequences, and more recently, topological data of biological
networks. Analysis of this huge data is a task on its own and the problems
encountered are NP-hard most of the time, defying solutions in polynomial time.
Such analysis is required as it provides a fundamental understanding of the func-
tioning of a cell which can help understand human health and disease states and the
diagnosis of diseases, which can further aid development of biotechnological
processes to be used for medical purposes such as treatment of diseases.
Instead of searching for optimal solutions to these difficult problems, approxi-
mation algorithms that provide sub-optimal solutions are usually preferred. An
approximation algorithm should guarantee a solution within an approximation
factor for all input combinations. In many cases, even approximation algorithms are
not known to date and using heuristics that are shown to work for most of the input
cases experimentally are considered as solutions.
Under these circumstances, there is an increasing demand and interest in
research community for parallel/distributed algorithms to solve these problems
efficiently using a number of processing elements. This book is about both
sequential and distributed algorithms for the analysis and modeling of biological
data and as such, it is one of the first ones in this topic to the best of our knowledge.
In the context of this book, we will assume a distributed algorithm as a parallel
algorithm executed on a distributed memory processing system using
message-passing rather than special purpose parallel architectures. For the cases of
shared memory parallel computing, we will use the term parallel algorithm
explicitly. We also cover algorithms for biological sequences (DNA and protein)
and biological network (protein interaction networks, gene regulation, etc.) data in
the same volume. Although algorithms for DNA sequences have a longer history of
study, even the sequential algorithms for biological networks such as the protein
interaction networks are rare and are at an early stage of development in research
studies. We aim to give a unified view of algorithms for sequences and networks of
biological systems where possible. These two views are not totally unrelated; for
example, the function of a protein is influenced by both its position in a network
vii
viii Preface
and its amino acid sequence, and also by its 3-D structure. It can also be seen that
the problems in the sequence and network domains are analogous to some extent;
for example, sequence alignment and network alignment, sequence motifs and
network motifs, sequence clustering and network clustering are analogous problems
in these two related worlds. It is not difficult to predict that the analysis of biological
networks will have a profound effect on our understanding the origins of life, health
and disease states, as analysis of DNA/RNA and protein sequences have provided.
The parallel and distributed algorithms are needed to solve bioinformatics
problems simply for the speedup they provide. Even the linear time algorithms may
be difficult to realize in bioinformatics due to the size of the data involved. For
example, suffix trees are fundamental data structures in bioinformatics, and con-
structing them takes OðnÞ time by relatively new algorithms such as Ukkonen’s or
Farach’s. Considering human DNA which consists of over 3 billion base pairs, even
these linear time algorithms are time-consuming. However, by using distributed
suffix trees, the time can be reduced to Oðn=kÞ where k is the number of processors.
One wonders then about the scarcity of the research efforts in the design and
implementation of distributed algorithms for these time-consuming difficult tasks.
A possible reason would be that a number of problems have been introduced
recently and the general approach in the research community has been to search for
sequential algorithmic solutions first and then investigate ways of parallelizing
these algorithms or design totally new parallel/distributed algorithms. Moreover,
the parallel and distributed computing is a principle on its own where researchers in
this field may not be familiar with bioinformatics problems in general, and the
multidisciplinary efforts in this discipline and bioinformatics seem to be just
starting. This book is an effort to contribute to the filling of the aforementioned gap
between the distributed computing and bioinformatics. Our main goal is to first
introduce the fundamental sequential algorithms to understand the problem and
then describe distributed algorithms that can be used for fundamental bioinfor-
matics problems such as sequence and network alignment, and finding sequence
and network motifs, and clustering. We review the most fundamental sequential
algorithms which aid the understanding of the problem better and yield
parallel/distributed versions. In other words, we try to be as comprehensive as
possible in the coverage of parallel/distributed algorithms for the fundamental
bioinformatics problems with an in-depth analysis of sequential algorithms.
Writing a book on bioinformatics is a challenging task for a number of reasons.
First of all, there are so many diverse topics to consider, from clustering to genome
rearrangement, from network motif search to phylogeny, and one has to be selective
not to divert greatly from the initially set objectives. We had to carefully choose a
subset of topics to be included in this book in line with the book title and aim; tasks
that require substantial computation power due to their data sizes and therefore are
good candidates for parallelization. Second, bioinformatics is a very dynamic area
of study with frequent new technological advances and results which requires a
thorough survey of contemporary literature on presented topics. The two worlds of
bioinformatics, namely biological sequences and biological networks, both have
similar challenging problems. A closer look reveals these two worlds in fact have
Preface ix
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Biological Sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Biological Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 The Need for Distributed Algorithms. . . . . . . . . . . . . . . . . . 6
1.5 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Part I Background
2 Introduction to Molecular Biology . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Central Dogma of Life. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 The Genetic Code . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Mutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Biotechnological Methods . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Polymerase Chain Reaction . . . . . . . . . . . . . . . . . . 20
2.4.3 DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Nucleotide Databases. . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Protein Sequence Databases . . . . . . . . . . . . . . . . . . 22
2.6 Human Genome Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Chapter Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
xi
xii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Introduction
1
1.1 Introduction
Biology is the science of life and living organisms. An organism is a living entity that
may consist of organs that are made of tissues. Cells are the building blocks of organ-
isms and form tissues of organs. Cells consist of molecules and molecular biology
is the science of studying the cell at molecular level. The nucleus of a cell contains
deoxyribonucleic acid (DNA) which stores all of the genetic information. DNA is
a double helix structure consisting of four types of molecules called nucleotides. It
consists of a long sequence of nucleotides of about 3 billion pairs. From the viewpoint
of computing, DNA is simply a string that has a four-letter alphabet. The ribonucleic
acid (RNA) has one strand and also consists of four types of nucleotides like DNA,
with one different nucleotide. Proteins are large molecules outside the nucleus of the
cell and perform vital life functions. A protein is basically a linear chain of molecules
called amino acids. Molecules of the cell interact to perform all necessary functions
for living.
Recent technological advancements have provided vast amounts of biological
data at molecular level. Analysis and extracting meaningful information from this
data requires new methods and approaches. This data comes in two basic forms as
sequence and network data. On one hand, we are provided with sequence data of
DNA/RNA and proteins, and on the other hand, we have topological information
about the connectivity of various networks within the cell. Analysis of this data is a
task on its own due to its huge size.
We first describe the problems encountered in the analysis of biological sequences
and networks and we then describe why distributed algorithms are imperative as
computational methods in this chapter. It seems design and implementation of dis-
tributed algorithms are inevitable for these time-consuming difficult problems and
their scarcity can be attributed to the relatively recent provision of the biological data
and the field being multidisciplinary in nature. We conclude by providing the outline
of the book.
The biological sequences in the cell we are referring are the nucleotide sequences in
DNA, RNA, and amino acid sequences in proteins. DNA contains four nucleotides in
a double-helix-shaped two-strand structure: Adenine (A), Cytosine (C), Guanine (G),
and Thymine (T). Adenine always pairs with thymine, and guanine with cytosine.
The primary structure of a protein consists of a linear chain of amino acids and the
order of these affects its 3D shape. Figure 1.1 shows a simplified diagram of DNA
and a protein.
Analysis of these biological sequences involves the following tasks:
(a) (b)
Fig. 1.1 a DNA double helix structure. b A protein structure having a linear sequence of amino
acids
1.2 Biological Sequences 3
and locations of repeats are unique for individuals and can be used in forensics to
identify individuals.
• Gene finding: A gene codes for a polypeptide which can form or be a part of a
protein. There are over 20,000 genes in human DNA which occupy only about
3 % of human genome. Finding genes is a fundamental step in their analysis. A
mutated gene may cause the formation of a wrong protein, disturbing the healthy
state of an organism, but mutations are harmless in many cases.
• Genome rearrangements: Mutations of DNA at a coarser level than point mutations
of nucleotides involve certain alterations of segments or genes in DNA. These
changes include reversals, duplications, and transfer to different locations in DNA.
Genome rearrangements may result in the production of new species but in many
cases, they are considered as the causes of complex diseases.
• Haplotype inference: The DNA sequencing methods of humans provide the order
of DNA nucleotides from two chromosomes as this approach is more cost-effective
and practical. However, the sequence information from a single chromosome called
a haplotype is needed for disease analysis and also to discover ancestral relation-
ships. Discovering single chromosome data from the two chromosomes is called
haplotype inference and is needed for the data to be meaningful
All of the above-described tasks are fundamental areas of research in the analysis
of biological sequences. Comparison of sequences, clustering, and finding repeats
apply both to DNA/RNA and protein sequences. Protein sequences may also be
employed to discover genes as they are the product of genes; however, genome
rearrangements and haplotype inference problems are commonly associated with
the DNA/RNA sequences. Except for the sequence alignment problem, there are
hardly any polynomial algorithms for these problems. Even when there is a solution
in polynomial time, the size of data necessitates the use of approximation algorithm
if they are available. As we will see, the heuristic algorithms that can be shown to
work for a wide range of input combinations experimentally are the only choice in
many cases.
Biological networks consist of biological entities which interact in some form. The
modeling and analysis of biological networks are fundamental areas of research
in bioinformatics. The number of nodes in a biological network is large and these
nodes have complex relations among them. We can represent a biological network
by a graph where an edge between two entities indicates an interaction between
them. This way, many results in graph theory and also various graph algorithms
become available for immediate use to help solve a number of problems in biological
networks.
4 1 Introduction
We can make coarse distinction between the networks in the cell and other bio-
logical networks. The cell contains DNA, RNA, proteins, and metabolites at the
molecular level. Networks at biological level are gene regulation networks, signal
transduction networks, protein–protein interaction (PPI) networks, and metabolic
networks. DNA is static containing the genetic code and proteins take part in various
vital functions in the cell. Genes in DNA code for proteins in a process called gene
expression.
DNA, RNA, proteins, and metabolites all have their own networks. Intracellular
biological networks have molecules within the cell as their nodes and their interac-
tions as links. PPI networks have proteins which interact with each other to cooperate
to accomplish various vital functions for life. PPI networks are not constructed ran-
domly and they show specific structures. They have a few number of highly connected
nodes called hubs and the rest of the nodes in the network have very few connec-
tions to other proteins. The distance between two farthest proteins in such a network
is very small compared to the size of the network. This type of networks is called
small-world network as it is relatively easy to reach a node from any other node
in the network. Hubs in PPI networks form dense regions of the network and these
clusters of nodes called protein complexes usually have some fundamental functions
attributed to them. It is therefore desirable to detect these clusters in PPI networks
and we will investigate algorithms for this purpose in Chap. 12. A PPI network has
very few hubs and many low-degree nodes. Such networks are commonly called
scale-free networks and a hub in a PPI network is presumed to act as a gateway
between a large number of nodes. Figure 1.2 shows a subnetwork of a PPI network
of the immune system where small-world and scale-free properties can be visually
detected along with a number of hubs.
The main areas of investigation in biological networks are the following:
Fig. 1.2 A subnetwork of the SpA PPI network involved in innate immune response (taken from
[1]). Triangles are the genes
there are only 13 possible motifs with 4 nodes, while the number of all subgraphs
of size 5 is 51. Comparison of networks and hence deducing ancestral relationships
are also possible by searching for common motifs in them.
• Network alignment: In many cases, two or more biological networks may be similar
and finding the similar subgraphs may show the conserved and therefore important
topological structures in them. A fundamental problem in biological networks is the
assessment of this similarity between two or more biological networks. Analogous
to sequence alignment of DNA/RNA and proteins, the network alignment process
aims to align two or more biological networks. This time, we basically search for
topological similarities between the networks. Node similarity is also considered
together with topological similarity to have a better view of the affinity of two
or more networks. Our aim in alignment is very similar to sequence alignment
in a different domain, and we try to compare two or more networks, find their
conserved modules, and infer ancestral relationships between them.
• Phyloegeny: Phylogeny is the study and analysis of evolutionary relationships
among organisms. Phylogenetics aims to discover these relationships using mole-
cular sequence and morphological data to reconstruct evolutionary dependencies
among the organisms. Phylogenetics is one of the most studied and researched
topics in bioinformatics and its implications are numerous. It can help disease
analysis and design of drug therapies by analyzing phylogenetic dependencies of
pathogens. Transmission characters of diseases can also be predicted using phy-
logenetics.
6 1 Introduction
Most of the problems outlined above do not have solutions in polynomial time
and we are left with approximation algorithms in rare cases but mostly with heuristic
algorithms. Since the size of the networks under consideration is huge, sampling
the network and implementing the algorithm in this sample is typically followed in
many methods. This approach provides a solution in reasonable time at the expense
of decreased accuracy.
There are many challenges in bioinformatics and we can first state that obtaining
meaningful data is difficult in general as it is noisy most of the time. The size of data
is large making it difficult to find time-efficient algorithms. Complexity class P is
related to problems that can be solved in polynomial time and some problems we need
to solve in bioinformatics belong to this class. Many problems, however, can only
be solved in exponential time and in some cases, solutions to the problem at hand do
not even exist. These problems belong to the class nondeterministic polynomial hard
(NP-hard) and we need to rely on approximation algorithms which find suboptimal
solutions with bounded approximation ratios for such hard problems. Most of the
time, there are no known approximation algorithms and we need to rely on heuristic
methods which show experimentally that the algorithm works for most of input
combinations.
Parallel algorithms and distributed algorithms provide faster execution as a num-
ber of processing elements are used in parallel. Here, we make a distinction between
shared memory parallel processing where processing elements communicate using a
shared and protected memory and distributed processing where computational nodes
of a communication network communicate by the transfer of messages without any
shared memory. Our main focus in this book is about distributed, that is, distrib-
uted memory, message-passing algorithms although we will also cover a number of
shared memory parallel algorithms. The efficiency of a parallel/distributed algorithm
is basically measured by the speedup it provides which is expressed as the ratio of
the number of time steps of the sequential algorithm T (s) to the number of time
steps of the parallel algorithm T ( p). If k processing elements are used, the speedup
should approach k ideally. However, there are the synchronization delays in shared
memory parallel processing and the message delays of the interconnection network
in distributed processing which means the speedup value could be far from ideal.
Whether the algorithm works in polynomial time to find an exact solution or an
approximate solution again in polynomial time; bioinformatics problems have an
additional property; the size of the input data is huge. We may therefore attempt
to provide a parallel or distributed version of an algorithm even if it does solve the
problem in polynomial time. For example, suffix trees are versatile data structures that
find various implementations in algorithmic solutions to bioinformatics problems.
There exist algorithms that construct suffix trees in a maximum of n steps where n
is the length of the sequence from which a suffix tree is built. Although this time
1.4 The Need for Distributed Algorithms 7
Our description of algorithms in each chapter follows the same model where the
problem is first introduced informally and the related concepts are described. We
then present the problem formally with the notation used and review the fundamen-
tal sequential algorithms with emphasis on the ones that have potential to be modified
and executed in parallel/distributed environments. After this step, we usually pro-
vide a template of a generic algorithm that summarizes the general approaches to
have a distributed algorithm for the problem at hand. This task can be accomplished
basically by partitioning data, partitioning code or both among the processing ele-
ments. We also propose new algorithms for specific problems under consideration
that can be implemented conveniently in this step. We then review published work
about parallel and distributed algorithms for the problem described, and describe
fundamental research studies that have received attention in more detail at algorith-
mic level. Finally, the main points of the chapter are emphasized with pointers for
future work in the Chapter Notes section.
We have three parts, and start with Part I which is about the basic background about
bioinformatics with three chapters on molecular biology; graphs and algorithms; and
parallel and distributed processing. This part is not a comprehensive treatment of
three main areas of research, but rather a dense coverage of these three major topics
with pointers for further study. It is self-contained, however, covering most of the
fundamental concepts needed for parallel/distributed processing in bioinformatics.
Part II describes the problems in the sequence analysis of biological data. We
start this part with string algorithms and describe algorithms for sequence alignment
which is one of the most researched topics in the sequence world. We then investigate
sequence motif search algorithm and describe genome analysis problems such as
gene finding, genome rearrangement, and haplotype inference at a coarser level.
In Part III, our main focus is on parallel and distributed algorithms for biological
networks, the PPI networks being the center of focus. We will see that the main
problems in biological networks can be classified into cluster detection, network
motif discovery, network alignment, and phylogenetics. Cluster detection algorithms
aim to find clusters of high inter-node connections and these dense regions may
implicate a specific function or health and disease states of an organism. Network
motif algorithms attempt to discover repeating patterns of subgraphs and they may
8 1 Introduction
indicate again some specific function attributed to these patterns. Network alignment
algorithms investigate the similarities between two or more graphs and show whether
these organisms have common ancestors and how well they are related. Analysis of
phylogenetic trees and networks which construct ancestor–descendant relationships
among existing individuals and organisms is also reviewed in this part which is
concluded by discussing the main challenges and future prospects of bioinformatics
in the Epilogue chapter.
Reference
1. Zhao J, Chen J, Yang T-H, Holme P (2012) Insights into the pathogenesis of axial
spondyloarthropathy from network and pathway analysis. BMC Syst Biol. 6(Suppl 1):S4.
doi:10.1186/1752-0509-6-S1-S4
Part I
Background
Introduction to Molecular Biology
2
2.1 Introduction
Modern biology has its roots at the work of Gregor Mendel who identified the fun-
damental rules of hereditary in 1865. The discovery of chromosomes and genes
followed later and in 1952 Watson and Crick disclosed the double helix structure of
DNA. All living organisms have common characteristics such as replication, nutri-
tion, growing and interaction with their environment. An organism is composed
of organs which perform specific functions. Organs are made of tissues which are
composed of aggregation of cells that have similar functions. The cell is the basic
unit of life in all living organisms and it has molecules that have fundamental func-
tions for life. Molecular biology is the study of these molecules in the cell. Two of
these molecules called proteins and nucleotides have fundamental roles to sustain
life. Proteins are the key components in everything related to life. DNA is made of
nucleotides and parts of DNA called genes code for proteins which perform all the
fundamental processes for living using biochemical reactions.
Cells synthesize new molecules and break large molecules into smaller ones using
complex networks of chemical reactions called pathways. Genome is the complete set
of DNA of an organism and human genome consists of chromosomes which contain
many genes. A gene is the basic physical and functional unit of hereditary that codes
for a protein which is a large molecule made from a sequence of amino acids. Three
critical molecules of life are deoxyribonucleic acid (DNA), ribonucleic acid (RNA),
and proteins. A central paradigm in molecular biology states that biological function
is heavily dependent on the biological structure.
In this chapter, we first review the functions performed by the cell and its ingre-
dients. The DNA contained in the nucleus, the proteins, and various other molecules
all have important functionalities and we describe these in detail. The central dogma
of life is the process of building up a protein from the code in the genes as we will
outline. We will also briefly describe biotechnological methods and introduce some
of the commonly used databases that store information about DNA, proteins, and
other molecules in the cell.
© Springer International Publishing Switzerland 2015 11
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_2
12 2 Introduction to Molecular Biology
Cells are the fundamental building blocks of all living things. The cell serves as a
structural building block to form tissues and organs. Each cell is independent and
can live on its own. All cells have a metabolism to take in nutrients and convert
them into molecules and energy to be used. Another important property of cells is
replication in which a cell produces another cell that has the same properties as
itself. Cells are composed of approximately 70 % water; 7 % small molecules like
amino acids, nucleotides, salts, and lipids, and 23 % macromolecules such as proteins
and polysaccharids. A cell consists of molecules in a dense liquid surrounded by a
membrane as shown in Fig. 2.1.
The eukaryotic cells have nuclei containing the genetic material which is sepa-
rated from the rest of the cell by a membrane and the prokaryotic cells do not have
nuclei. Prokaryotes include bacteria and archaea; and plants, animals, and fungi are
examples of eukaryotes. The tasks performed by the cells include taking nutrients
from food, converting these to energy, and performing various special missions. A
cell is composed of many parts each with a different purpose. The following are the
important parts of an eukaryotic cell with their functions:
The nucleus is at the center of the cell and is responsible for vital functions such
as cell growth, maturation, division, or death. Cytoplasm consists of jellylike fluid
which surrounds the nucleus and it contains various other structures. Endoplasmic
reticulum enwraps the nucleus, and processes molecules made by the cell and trans-
ports them to their destinations. Conversion of energy from food to a form that can be
used by the cell is performed by mitochondria which have their own genetic material.
These components of the cell are shown in Fig. 2.1. The cell contains various other
structures than the ones we have outlined here.
Chemically, cell is composed of few elements only. Carbon (C), hydrogen (H),
oxygen (O), and nitrogen (N) are the dominant ones with phosphorus (P) and sulfur
(S) appearing in less proportions. These elements combine to form molecules in
the cell, using covalent bonds in which electrons in their outer orbits are shared
between the atoms. A nucleotide is one such molecule in the cell which is a chain
of three components: a base B, a sugar S, and a phosphoric acid P. The three basic
macromolecules in the cell that are essential for life are the DNA, RNA, and proteins.
2.2.1 DNA
James Watson and Francis Crick discovered the Deoxyribonucleic Acid (DNA) struc-
ture in the cell in 1953 using X-ray diffraction patterns which showed that the DNA
molecule is long, thin, and has a spiral-like shape [5]. The DNA is contained in the
nuclei of eukaryotic cells and is composed of small molecules called nucleotides.
Each nucleotide consists of a five-carbon sugar, a phosphate group, and a base. The
carbon atoms in a sugar molecule are labeled 1 to 5 and using this notation, DNA
molecules start at 5 end and finish at 3 end as shown in Fig. 2.2. There are four
nucleotides in the DNA which are distinguished by the bases they have: Adenine
(A), Cytosine (C), Guanine (G), and Thymine (T). We can therefore think of DNA
as a string with a four letter alphabet = {A,C,G,T}. Human DNA consists approx-
imately of three billion bases. Nucleotide A pairs only with T, and C pairs only with
G, we can say A and T are complementary and so are G and C as shown in Fig. 2.2.
Given the sequence S of a DNA strand, we can construct the other strand S by
taking the complement of bases in this strand. If we take the complement of the
G T A C
A
T
G T
G A
G
A
C T C
T T A
C C
C A A T G
G
resulting strand we will obtain the original strand. This process is used and essential
for protein production. Physically, DNA consists of two strands held together by
hydrogen bonds, arranged in a double helix as shown in Fig. 2.3. The complement
of a DNA sequence consists of complements of its bases. The DNA therefore consists
of two complementary strands which bind to each other tightly providing a stable
structure. This structure also provides the means to replicate in which the double
DNA helix structure is separated into two strands and each of these strands are then
used as templates to synthesize their complements.
The DNA molecule is wrapped around proteins called histones into complex-
walled structures called chromosomes in the nucleus of each cell. The number of
chromosomes depends on the type of eukaryote species. Each chromosome consists
of two chromatides which are coil-shaped structures connected near the middle
forming an x-like structure. Chromosomes are kept in the nucleus of a cell in a
highly packed and hierarchically organized form. A single set of chromosomes in an
organism is called haploid, two sets of chromosomes is called di ploid, and more
than two sets is called polyploid. Humans are diploid where each chromosome is
inherited from a parent to have two chromosomes for each of the 23 chromosome set.
The sex chromosome is chromosome number 23 which either has two chromosomes
shaped X resulting in a female, or has X and Y resulting in a male. The type of
chromosome inherited from father determines the sex of the child in this case.
2.2.2 RNA
The ribonucleic acid (RNA) is an important molecule that is used to transfer genetic
information. It has a similar structure to DNA but consists of only one strand and
does not form a helix structure like DNA. It also has nucleotides which consist of a
sugar, phosphate, and a base. The sugar however is a ribose instead of deoxyribose
and hence the name RNA. Also, DNA base thymine (T) is replaced with uracil (U)
in RNA. The fundamental kinds of RNA are the messenger RNA (mRNA), transfer
RNA (tRNA), and ribosomal RNA (rRNA) which perform different functions in the
cell. RNA provides a flow of information in the cell. First, DNA is copied to mRNA
2.2 The Cell 15
in the nucleus and the mRNA is then translated to protein in the cytoplasm. During
translation, tRNA and rRNA have important functions. The tRNA is responsible for
forming the amino acids which make up the protein, as prescribed in the mRNA; and
the rRNA molecules are the fundamental building blocks of the ribosomes which
carry out translation of mRNA to protein.
2.2.3 Genes
A gene is the basic unit of hereditary in a living organism determining its character
as a whole. A gene physically is a sequence of DNA that codes for an RNA (mRNA,
tRNA, or rRNA) and the mRNA codes for a protein. The study of genes is called
genetics. Gregor Mendel in the 1860 s was first to experiment and set principles of
passing hereditary information to offsprings.
There are 23 pairs of chromosomes in humans and between 20000–25000 genes
are located in these chromosomes. The starting and stopping locations of a gene are
identified by specific sequences. The protein coding parts of a gene are called exons
and the regions between exons with no specific function are called introns. Genes
have varying lengths and also, exons and introns within a gene have varying lengths.
A gene can combine with other genes or can be nested within another gene to yield
some functionality, and can be mutated which may change its functionality at varying
degrees in some cases leading to diseases. The complete set of genes of an organism
is called its genot ype. Each gene has a specific function in the physiology and mor-
phology of an organism. The physical manifestation or expression of the genotype
is the phenot ype which is the physiology and morphology of an organism cite. A
gene may have different varieties called alleles resulting in different phenotyping
characteristics. Humans are diploid meaning we inherit a chromosome from each
parent, therefore we have two alleles of each gene. The genes that code for proteins
constitute about 1.5 % of total DNA and the rest contains RNA encoding genes and
sequences that are not known to have any function. This part of DNA is called junk
DNA. There is no relatedness between the size of genome, number of genes, and
organism complexity. In fact, some single cell organisms have a larger genome than
humans.
2.2.4 Proteins
Proteins are large molecules of the cell and they carry out many important functions.
For example, they form the antibodies which bind to foreign particles such as viruses
and bacteria. As enzymes, they work as catalysts for various chemical reactions;
the messenger proteins transmit signals to coordinate biological processes between
different cells, tissues, and organs, also they transport small molecules within the cell
and the body. Proteins are made from the information contained in genes. A protein
consists of a chain of amino acids connected by peptide bonds. Since such a bond
releases a water molecule, what we have inside a protein is a chain of amino acid
16 2 Introduction to Molecular Biology
residues. Typically, a protein has about 300 amino acid residues which can reach
5000 in large proteins.The essential 20 amino acids that make up the proteins is
shown in Table 2.1 with their abbreviations, codes, and polarities.
Proteins have highly complex structures and can be analyzed at four hierarchical
structures. The primary structure of a protein is specified by a sequence of amino
acids that are linked in a chain and the secondary structure is formed by linear
regions of amino acids. A protein domain is a segment of amino acid sequences in
a protein which has independent functions than the rest of the protein. The protein
also has a 3D structure called tertiary structure which affects its functionality and
several protein molecules are arranged in quaternary structure. The function of a
protein is determined by its four layer structure. A protein has the ability to fold
in 3D and its shape formed as such affects its function. Using its 3D shape, it can
bind to certain molecules and interact. For example, mad cow disease is believed to
be caused by the wrong folding of a protein. For this reason, predicting the folding
structure of a protein from its primary sequence and finding the relationship between
its 3D structure and its functionality has become one of the main research areas in
bioinformatics.
The central dogma of molecular biology and hence life was formulated by F. Crick
in 1958 and it describes the flow of information between DNA, RNA, and proteins.
This flow can be specified as DNA → mRNA → protein as shown in Fig. 2.4. The
forming of mRNA from a DNA strand is called transcription and the production
of a protein based on the nucleotide sequence of the mRNA is called translation as
described next.
2.3 Central Dogma of Life 17
2.3.1 Transcription
In the transcription phase of protein coding, a single stranded RNA molecule called
mRNA is produced which is complementary to the DNA strand it is transcribed. The
transcription process in eukaryotes takes place in the nucleus. The enzyme called
RNA polymerase starts transcription by first detecting and binding a pr omoter region
of a gene. This special pattern of DNA shown in Fig. 2.5 is used by RNA polymerase
to find where to begin transcription. The reverse copy of the gene is then synthesized
by this enzyme and a terminating signal sequence in DNA results in the ending of
this process after which pre-mRNA which contains exons and introns is released.
A post-processing called splicing involves removing the introns received from the
gene and reconnecting the exons to form the mature and much shorter mRNA which
is transferred to cytoplasm for the second phase called translation. The complete
gene contained in the chromosome is called genomic DNA and the sequence with
exons only is called complementar y DNA or cDNA [25].
The genetic code provides the mapping between the sequence of nucleotides and the
type of amino acids in proteins. This code is in triplets of nucleotide bases called
codons where each codon encodes one amino acid. Since there are four nucleotide
bases, possible total number of codons is 43 = 64. However, proteins are made of
20 amino acids only which means many amino acids are specified by more than one
codon. Table 2.2 displays the genetic code.
Such redundancy provides fault tolerance in case of mutations in the nucleotide
sequences in DNA or mRNA. For example, a change in the codon UUA may result in
UUG in mRNA but the amino acid leucine corresponding to each of these sequences
is formed in both cases. Similarly, all of the three codons UAA, UAG, and UGA cause
termination of the polypeptide sequence and hence a single mutation from A to G or
from G to A still causes termination preventing unwanted growth due to mutations.
Watson et al. showed that the sequence order of codons in DNA correspond directly
to the sequence order of amino acids in proteins [28]. The codon AUG specifies the
beginning of a protein amino acid sequence, therefore, the amino acid methionine is
found as the first amino acid in all proteins.
2.3.3 Translation
The translation phase is the process where a mature mRNA is used as a template
to form proteins. It is carried out by the large molecules called ribosomes which
consist of proteins and the ribosomal RNA (rRNA) [5]. A ribosome uses tRNA to
2.3 Central Dogma of Life 19
first detect the start codon in the mRNA which is the nucleotide base sequence AUG.
The tRNA has three bases called anticodons which are complementary to the codons
it reads. The amino acids as prescribed by the mRNA are then formed and added to
the linear protein structure according to the genetic code. Translation to the protein
is concluded by detecting one of the three stop codons. Once a protein is formed, a
protein may be transferred to the needed location by the signals in the amino acid
sequence. The new protein must fold into a 3D structure before it can function [27].
Figure 2.6 displays the transcription and translation phases of a superficial protein
made of six amino acids as prescribed by the mRNA.
2.3.4 Mutations
2.4.1 Cloning
The polymerase chain reaction (PCR) developed by Kary Mullis [3] in 1983, is a
biomedical technology used to amplify selected DNA segment over several orders
of magnitude. The amplification of DNA is needed for a number of applications
including analysis of genes, discovery of DNA motifs, and diagnosis of hereditary
diseases. PCR uses thermal cycling in which two phases are employed. In the first
phase, the DNA is separated into two strands by heat and then, a single strand is
enlarged to a double strand by the inclusion of a primer and polymerase processing.
DNA polymerase is a type of enzyme that synthesizes new strands of DNA comple-
mentary to the target sequence. These two steps are repeated many times resulting
in an exponential growth of the initial DNA segment. There are some limitations
of PCR processing such as the accumulation of pyrophosphate molecules and the
existence of inhibitors of the polymerase in the DNA sample which results in the
stopping of the amplification.
2.4 Biotechnological Methods 21
The sequence order of bases in DNA is needed to find the genetic information. DNA
sequencing is the process of obtaining the order of nucleotides in DNA. The obtained
sequence data can then be used to analyze DNA for various tasks such as finding
evolutionary relationships between organisms and treatment of diseases. The exons
are the parts of DNA that contain genes to code for proteins and all exons in a genome
is called exome. Sequencing exomes is known as whole exome sequencing. However,
research reveals DNA sequences external to the exons also affect protein coding and
health state of an individual. In whole genome sequencing, the whole genome of an
individual is sequenced. The new generation technologies are developed for both of
these processes. A number of methods exist for DNA sequencing and we will briefly
describe only the few fundamental ones.
The sequencing technology called Sanger sequencing named after Frederick
Sanger who developed it [23,24], used deoxynucleotide triphosphates (dNTPs) and
di-deoxynucleotide triphosphates (ddNTPs) which are essentially nucleotides with
minor modifications. The DNA strand is copied using these altered bases and when
these are entered into a sequence, they stop the copying process which results in
different lengths of short DNA segments. These segments are ordered by size and
the nucleotides are read from the shortest to the longest segment. Sanger method is
slow and new technologies are developed. The shotgun method of sequencing was
used to sequence larger DNA segments. The DNA segment is broken into many over-
lapping short segments and these segments are then cloned. These short segments
are selected at random and sequenced in the next step. The final step of this method
involves assembling the short segments in the most likely order to determine the
sequence of the long segment, using the overlapping data of the short segments.
Next generation DNA sequencing methods employ massively parallel processing
to overcome the problems of the previous sequencing methods. Three platforms are
widely used for this purpose: the Roche/454 FLX [21], the Illumina/Solexa Genome
Analyzer [4], and the Ion Torrent: Proton/PGM Sequencing [12]. The Roche/454
FLX uses the pyrosequencing method in which the input DNA strand is divided
into shorter segments which are amplified by the PCR method. Afterward, multiple
reads are sequenced in parallel by detecting optical signals as bases are added. The
Illumina sequencing uses a similar method, the input sample fragment is cleaved
into short segments and each short segment is amplified by PCR. The fragments are
located in a slide which is flooded with nucleotides that are labeled with colors and
DNA polymerase. By taking images of the slide and adding bases, and repeating this
process, bases at each site can be detected to construct the sequence. The Ion proton
sequencing makes use of the fact that addition of a dNTP to a DNA polymer releases
an H+ ion. The preparation of the slide is similar to other two methods and the slide
is flooded with dNTPs. Since each H+ decreases pH, the changes in pH level is used
to detect nucleotides [8].
22 2 Introduction to Molecular Biology
2.5 Databases
The major databases for nucleotides are the GenBank [10], the European Molec-
ular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) database
[19], and the DNA Databank of Japan (DDJB) [26]. GenBank is maintained by the
National Center for Biotechnology Information (NCBI), U.S. and contains sequences
for various organisms including primates, plants, mammals, and bacteria. It is a fun-
damental nucleic acid database and genomic data is submitted to GenBank from
research projects and laboratories. Searches in this database can be performed by
keywords or by sequences. The EMBL-EBI database is based on EMBL Nucleotide
Sequence Data Library which was the first nucleotide database in the world and
receives contributions from projects and authors. EMBL supports text-based retrieval
tools including SRS and BLAST and FASTA for sequence-based retrieval [9].
Protein sequence databases provide storage of protein amino acid sequence informa-
tion. Two commonly used protein databases are the Protein Identification Resource
(PIR) [16,31] and the UniProt [15] containing SwissProt [2]. The PIR contains
protein amino acid sequences and structures of proteins to support genomic and pro-
teomic research. It was founded by the National Biomedical Research Foundation
(NBRF) for the identification and interpretation of protein sequence information,
and the Munich Information Center for Protein Sequences (MIPS) [22] in Germany,
and the Japan International Protein Information Database later joined this database.
SwissProt protein sequence database was established in 1986 and provided protein
functions, their hierarchical structures, and diseases related to proteins. The Uni-
versal Protein Resource (UniProt) is formed by the collaboration of EMBL-EBI,
Swiss Institute of Bioinformatics (SIB) and PIR in 2003 and SwissProt was incor-
porated into UniProt. PDBj (Protein Data Bank Japan) is a protein database in Japan
providing an archive of macromolecular structures and integrated tools [17].
2.6 Human Genome Project 23
1. For the DNA base sequence S = AACGTAGGCTAAT, work out the complemen-
tary sequence S and then the complementary of the sequence S .
2. A superficial gene has the sequence CCGTATCAATTGGCATC. Assuming this
gene has exons only, work out the amino acid of the protein to be formed.
3. Discuss the functions of three RNA molecules named tRNA, cRNA, and mRNA.
4. A protein consists of the amino acid sequence A-B-N-V. Find three gene
sequences that could have resulted in this protein.
5. Why is DNA multiplying needed? Compare the cloning and PCR methods of
multiplying DNA in terms of technology used and their performances.
References
1. Alberts B, Bray D, Lewis J, Raff M, Roberts K (1994) Molecular biology of the cell. Garland
Publishing, New York
2. Bairoch A, Apweiler R (1999) The SWISS-PROT protein sequence data bank and its supple-
ment TrEMBL in 1999. Nucleic Acids Res 27(1):49–54
3. Bartlett JMS, Stirling D (2003) A short history of the polymerase chain reaction. PCR Protoc
226:36
4. Bentley DR (2006) Whole-genome resequencing. Curr Opin Genet Dev 16:545–552
5. Brandenberg O, Dhlamini Z, Sensi A, Ghosh K, Sonnino A (2011) Introduction to molecular
biology and genetic engineering. Biosafety Resource Book, Food and Agriculture Organization
of the United Nations
6. Caspi R, Altman T, Dreher K, Fulcher CA, Subhraveti P, Keseler IM, Kothari A, Krummenacker
M, Latendresse M, Mueller LA, Ong Q, Paley S, Pujar A, Shearer AG, Travers M, Weerasinghe
D, Zhang P, Karp PD (2011) The MetaCyc database of metabolic pathways and enzymes
and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 40(Database
issue):D742D753
References 25
7. CBD (Convention on Biological Diversity). 5 June 1992. Rio de Janeiro. United Nations
8. http://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-
course
9. http://www.ebi.ac.uk/embl/
10. http://www.ncbi.nlm.nih.gov/genbank
11. http://www.ncbi.nlm.nih.gov/geo/
12. http://www.iontorrent.com/. Ion Torrent official page
13. http://www.genome.jp/kegg/pathway.html
14. http://www.ebi.ac.uk/arrayexpress/. Website of ArrayExpress
15. http://www.uniprot.org/. Official website of PIR at Georgetown University
16. http://pir.georgetown.edu/. Official website of PIR at Georgetown University
17. http://www.pdbj.org/. Official website of Protein Databank Japan
18. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids
Res 28(1):2730
19. Kneale G, Kennard O (1984) The EMBL nucleotide sequence data library. Biochem Soc Trans
12(6):1011–1014
20. Lewin B (1994) Genes V. Oxford University Press, Oxford
21. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS et al (2005) Genome sequencing in
microfabricated high-density picolitre reactors. Nature 437:376–380
22. Mewes HW, Andreas R, Fabian T, Thomas R, Mathias W, Dmitrij F, Karsten S, Manuel S,
Mayer KFX, Stmpflen V, Antonov A (2011) MIPS: curated databases and comprehensive
secondary data resources in 2010. Nucleic Acids Res (England) 39
23. Sanger F, Coulson AR (1975) A rapid method for determining sequences in DNA by primed
synthesis with DNA polymerase. J Mol Biol 94(3):441–448
24. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors.
Proc Natl Acad Sci USA 74(12):5463–5467
25. Setubal JC, Meidanis J (1997) Introduction to computational molecular biology. PWS Publish-
ing Company, Boston
26. Tateno Y, Imanishi T, Miyazaki S, Fukami-Kobayashi K, Saitou N, Sugawara H et al (2002)
DNA data bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res
30(1):27–30
27. Voet D, Voet JG (2004) Biochemistry, 3rd edn. Wiley
28. Watson J, Baker T, Bell S, Gann A, Levine M, Losick R (2008) Molecular biology of the gene.
Addison-Wesley Longman, Amsterdam
29. Watson JD (1986) Molecular biology of the gene, vol 1. Benjamin/Cummings, Redwood City
30. Watson JD, Hopkins NH, Roberts JW, Steitz JA, Weiner AM (1987) Molecular biology of the
gene, vol 2. Benjamin/Cummings, Redwood City
31. Wu C, Nebert DW (2004) Update on genome completion and annotations: protein information
resource. Hum Genomics 1(3):229–233
Graphs, Algorithms, and Complexity
3
3.1 Introduction
Graphs are discrete structures which are frequently used to model biological net-
works. We start this chapter with a brief review of basics of graph theory concepts
and then describe concepts such as connectivity and distance that are used for the
analysis of biological networks. Trees are graphs with no cycles and are used for var-
ious representations of biological networks. Spectral properties of graphs provide
means to explore structures such as clusters of biological networks and are briefly
described.
Algorithms have key functionality to solve bioinformatics problems and we pro-
vide a short review of fundamental algorithmic methods starting by describing the
complexity of the algorithms. Our emphasis in this section is again on methods
such as dynamic programming and graph algorithms that find many applications
in bioinformatics. Basic graph traversal procedures such as the breadth-first search
and depth-first search are frequently employed for various problems in biological
networks and are described in detail. We then introduce complexity classes NP,
NP-hard, and NP-complete for problems that cannot be solved in polynomial time.
Using approximation algorithms with known approximation ratios provides subop-
timal solutions to difficult problems and sometimes our only choice in tackling a
bioinformatics problem is the use of heuristics which are commonsense rules that
work for most cases of the input. We present a rather dense review of all of these key
concepts in relation to bioinformatics problems in this chapter.
3.2 Graphs
Graphs are frequently used to model networks of any kind. A graph G(V, E) has a
nonempty vertex set V and a possibly nonempty edge set E. The number of vertices
of a graph is called its or der and the number of its edges is called its si ze. An edge
© Springer International Publishing Switzerland 2015 27
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_3
28 3 Graphs, Algorithms, and Complexity
a c a c a c
e d e d e d
e of a graph is incident to two vertices called its endpoints. A loop is an edge that
starts and ends at the same vertex. Multiple edges e1 and e2 are incident to the same
endpoint vertices. A simple graph does not have any loops or multiple edges. In
the context of this book, we will consider only simple graphs. The complement of a
graph G(V, E) is the graph G(V, E ) which has the same vertex set as G and an edge
(u, v) ∈ E if and only if (u, v) ∈
/ E. A clique is a graph with the maximum number
of connections between its vertices. Each vertex in a clique has n − 1 connections
to all other vertices where n is the order of the clique. A maximal clique is a clique
which is not a proper subset of another clique. A maximum clique of a graph G is a
clique of maximum size in G. Figure 3.1 displays example graphs.
The edges of a directed graph have orientation between their endpoints. An ori-
ented edge (u, v) of a directed graph shown by an arrow from u to v, starts from
vertex u and ends at vertex v. The degree of a vertex v of a graph G is the number
of edges that are incident to v. The maximum degree of a graph is shown by Δ(G).
The in-degree of a vertex v in a directed graph is the number of edges that end at v
and the out-degree of a vertex v in such a graph is the number of edges that start at
v. A directed graph is depicted in Fig. 3.2a. The in-degrees of vertices a, b, c, d, e in
(a) (b) f
a c b
a
b d
e e
d
g
(a) (b) 4 5
1 2 6
e2
1 2
e1 e3 2 1 3 4
6 e7 e9 3
3 2 4
e8
e6 e4
5 4 4 1 2 3 5
e5
5 1 4 6
6 1 5
neighbor requires n steps. For sparse graphs where the density of the graph is low
with relatively much less number of edges than dense graphs (m n), this structure
has the disadvantage of occupying unnecessary memory space. The adjacency list
takes m + n memory locations and examining all neighbors of a given node requires
m/n time on average. However, checking the existence of an edge requires m/n
steps whereas we can achieve this in one step using an adjacency matrix. In general,
adjacency lists should be preferred for sparse graphs and adjacency matrices for
dense graphs.
(a) (b) 1
a b
9 2
a b c d
8
f g
10 c
11 12 7
6
h g f e i h
5 3
e d
4
Fig. 3.4 a The bridge (g, f ) of a graph. The endpoints of this bridge are also cut vertices of the
graph. b The Hamiltonian cycle is {a, b, c, d, e, i, h, g, f, a} and the sequence of Eularian trail
starting from vertex a is shown by the numbers on edges
(a)
(b) 3
1 2
8 9
7 4 1 3
6
2
6
5 4
5
3.2.4 Trees
A tr ee is a graph with no cycles and a f or est is a graph that has two or more
disconnected trees. A spanning tree T (V, E ) of a graph G(V, E) is a tree that covers
all vertices of G where |E | ≤ |E|. A tree where a special vertex is designated as
root with all vertices having an orientation toward it is called a rooted tree, otherwise
the tree is unr ooted. Any vertex u except the root in a rooted tree is connected to
a vertex v on its path to the root called its par ent, and the vertex u is called the
child of vertex v. For an edge-weighted graph G(V, E, w), the minimum spanning
tree (MST) of G has the minimum total sum of weights among all possible spanning
trees of G. The MST of a graph is unique if all of its edges have distinct weights.
Figure 3.5 shows a forest consisting of unrooted trees and a rooted MST of a graph.
Given a square matrix A[n ×n], an eigenvalue λ and the corresponding eigenvector
x of A satisfy the following equation:
Ax = λx (3.1)
which can be written as
Ax − λx = 0 (3.2)
(A − λI )x = 0 (3.3)
The necessary condition for Eq. 3.2 to hold is det (A − λI ) to be 0. In order to
find an eigenvalue of a matrix A, we can do the following [3]:
1. Form matrix (A − λI ).
2. Solve the equation for λ values.
3. Substitute each eigenvalue in Eq. 3.1 to find x vectors corresponding to λ values.
Eigenvalues are useful in solving many problems in engineering and basic sci-
ences. Our main interest in eigenvalues and eigenvectors will be their efficient use
for clustering in biological networks. The Laplacian matrix of a graph is obtained by
subtracting its adjacency matrix A from its degree matrix D which has dii = deg(i)
at diagonal elements and di j = 0 otherwise as L = D − A. An element of the
Laplacian matrix therefore is given by
⎧
⎨ 1 if i = j and deg( j) = 0
L i j = −1 if i and j are adjacent (3.4)
⎩
0 otherwise
The Laplacian matrix L provides useful information about the connectivity of
a graph [4]. It is positive semidefinite with all eigenvalues except the smallest one
being positive and the smallest eigenvalue is 0. The number of components of a graph
3.2 Graphs 33
G is given by the number of eigenvalues of its Laplacian which are 0. The second
smallest eigenvalue of L(G) shows whether graph G is connected or not, a positive
value showing connectedness. A greater second eigenvalue shows a better connected
graph.
3.3 Algorithms
The time complexity of an algorithm is the number of steps needed to obtain the
output and is commonly expressed in terms of the size of the input. In Algorithm
3.1, we would need at least n steps since the f or loop is executed n times. Also, the
assignment outside the loop needs 1 step. However, we would be more interested in
the part of the algorithm that has a running time dependent on the input size, as the
execution times of the other parts of an algorithm would diminish when the input
size gets larger. In the above example, the algorithm has to be executed exactly n
times. However, if we change the requirement to find the first occurrence of the input
key, then the execution time would be at most n steps.
34 3 Graphs, Algorithms, and Complexity
3.3.2 Recurrences
A recursive algorithm calls itself with smaller values until a base case is encountered.
A check for the base case is typically made at the start of the algorithm and if this
condition is not met, the algorithm calls itself with possibly a smaller value of the
input. Let us consider finding the nth power of an integer x using a recursive algorithm
called Power as shown in Algorithm 3.2. The base case which is the start of the
returning point from the recursive calls is when n equals 0, and each returned value
is multiplied by the value of x when called which results in the value x n .
Recursive algorithms result in simpler code in general but their analysis may not
be simple. They are typically analyzed using recurrence relations where time to
solve the algorithm for an input size is expressed in terms of the time for smaller
input sizes. The following recurrence equation for the Power algorithm defines the
relation between the time taken for the algorithm for various input sizes:
3.3 Algorithms 35
T (n) = T (n − 1) + 1 (3.5)
= T (n − 1) + 1 = T (n − 2) + 2
= T (n − 2) + 2 = T (n − 3) + 3
Proceeding in this manner, we observe that T (n) = T (n −k)+k and when n = k,
T (n) = T (0) + n = n steps are required, assuming that the base case does not take
any time. We could have noticed that there are n recursive calls until the base case
of n = 0 and each return results in one multiplication resulting in a total of n steps.
The analysis of recurrences may be more complicated than our simple example and
the use of more advanced techniques such as the Master theorem may be needed [7].
The fundamental classes of algorithms are the greedy algorithms, divide and conquer
algorithms, dynamic programming, and the graph algorithms. A greedy algorithm
makes the locally optimal choice at each step and these algorithms do not provide an
optimal solution in general but may find suboptimal solutions in reasonable times.
We will see a greedy algorithm that finds MST of a graph in reasonable time in graph
algorithms.
Divide and conquer algorithms on the other hand, divide the problem into a number
of subproblems and these subproblems are solved recursively, and the solutions of the
smaller subproblems may be combined to find the solution to the original problem.
As an example of this method, we will describe the mergesor t algorithm which
sorts elements of an array recursively. The array is divided into two parts, each part
is sorted recursively and the sorted smaller arrays are merged into a larger sorted
array. Merging is the key operation in this algorithm and when merging two smaller
sorted arrays, the larger array is formed by finding the smallest value found in both
smaller arrays iteratively. For example, merging the two sorted arrays {2, 8, 10, 12}
and {1, 4, 6, 9} results in {1, 2, 4, 6, 8, 9, 10, 12}. The dynamic programming method
and the graph algorithms have important applications in bioinformatics and they are
described next.
Algorithm 3.3 shows how Fibonacci sequence can be computed using dynamic
programming. The array F is filled with the n members of this sequence at the end
of the algorithm which requires Θ(n) steps. Dynamic programming is frequently
used in bioinformatics problems such as sequence alignment and DNA motif search
as we will see in Chaps. 6 and 8.
Theorem 3.1 The time complexity of BFS algorithm is Θ(n + m) for a graph of
order n and size m.
3.3 Algorithms 37
Proof The initialization between lines 3 and 6 takes Θ(n) time. The while loop is
executed at most n times and the f or loop between the lines 12 and 18 is run at most
deg(u)+1 times considering the vertices with no neighbors. Total running time in
this case is
(a) 1 2 3 (b) 1 12 5 8 6 7
b c d b c d
a a
g f e g f e
1 2 3 9 10 2 11 3 4
Fig. 3.6 a A BFS tree from vertex a. b A DFS tree from vertex a in the same graph. The first and
last visit times of a vertex are shown in teh left and right of a vertex consecutively
38 3 Graphs, Algorithms, and Complexity
There are two loops in the algorithm; the first loop considers each vertex at a
maximum of O(n) time and the second loop considers all neighbors of a vertex in
a total of u N (u) = 2m time. Time complexity of DFS algorithm is, therefore,
Θ(n + m).
Figure 3.7 shows the iterations of Prim’s algorithm in a sample graph. The com-
plexity of this algorithm is O(n 2 ) as the while loop is executed for all vertices and
the search for the minimum weight edge for each vertex will take O(n) time. This
complexity can be reduced to O(m log n) time using binary heaps and adjacency
lists.
(a) 1 (b) 1
a b a b
2 2
3 7 6 3 7
e 6 e
4 5 4 5
d c d c
8 8
(c) (d)
1
1 a b
a b 2
2
3 7
3 e 6
e 6
7 4 5
4 5 d c
d c
8
8
The execution of this algorithm is shown in Fig. 3.8 where shortest path tree T
is formed after four steps. Since there are n vertices, n − 1 edges are added to
the initial tree T consisting of the source vertex s only. The time complexity of
the algorithm depends on the data structures used. As with the Prim’s algorithm,
there are two nested loops and implementation using adjacency list requires O(n 2 )
time. Using adjacency lists and Fibonacci heaps, this complexity may be reduced to
O(m + n log n).
This algorithm only finds the shortest path tree from a source vertex and can be
executed for all vertices to find all shortest paths. It can be modified so that the
predecessors of vertices are stored (See Exercise 3.6).
3.3 Algorithms 41
(a) (b) 4
4
~ ~ 3 ~
9 b 9 c
b c
8 8
2 2 7
a 1 7 a 1
2 d 2 e d
e
6 6 ~
2 ~ 2
(c) (d)
4 4
3 4 3 4
9 b 9 c
b c
8 8
2 7 2 7
a 1 a 1
2 e d 2 e d
6 6
2 ~ 2 5
Given a simple and undirected graph G(V, E), a special subgraph G ⊆ G has some
property which may be implemented to discover a structure in biological networks.
The special subgraphs we will consider are the independent sets, dominating sets,
matching and the vertex cover.
(a) (b)
Fig. 3.9 a A maximum independent set of a graph. b The minimum dominating set of the same
graph
3.3.6.3 Matching
A matching in a graph G(V, E) is a subset E of its edges such that the edges in E
do not share any endpoints. Matching finds various applications including routing
in computer networks. A maximum matching of a graph G has the maximum size
among all matchings of G. A maximal matching of G is a matching in G which
cannot be enlarged. Finding maximum matching in a graph can be performed in
polynomial time [2] and hence is in P. A maximum matching of size 4 is depicted in
Fig. 3.10a.
(a) (b)
3.4 NP-Completeness
All of the algorithms we have considered up to this point have polynomial execu-
tion times which can be expressed as O(n k ) where n is the input size and k ≥ 0.
These algorithms are in complexity class P which contains algorithms with polyno-
mial execution times. Many algorithms, however, do not belong to P as either they
have exponential running times or sometimes the problem at hand does not even
have a known solution with either polynomial time or exponential time. Problems
with solutions in P are called tractable and any other problem outside P is called
intractable.
We need to define some concepts before investigating complexity classes other
than P. Problems we want to solve are either optimization problems where we try
to obtain the best solution to the problem at hand, or decision problems where we
try to find an answer in the form of a yes or no to a given instance of the problem
as described before. A cer ti f icate is an input combination for an algorithm and a
veri f ier (or a cer ti f ier ) is an algorithm that checks a certifier for a problem and
provides an answer in the form of a yes or no. We will exemplify these concepts by the
use of subset sum problem. Given a set of n numbers S = {i 1 , . . . , i n }, this problem
asks to find a subset of S sum of which is equal to a given value. For example, given
S = {3, 2, −5, −2, 9, 6, 1, 12} and the value 4, the subset {3, 2, −1} provides a yes
answer. The input {3, 2, −1} is the certificate and the verifier algorithm simply adds
the values of integers in the certifier in k − 1 steps where k is the size of the certifier.
Clearly, we need to check each subset of S to find a solution if it exits and this can
be accomplished in exponential time of 2n which is the number of subsets of a set
having n elements. Now, if we are given an input value m and a specific subset R of
S as a certifier, we can easily check in polynomial time using Θ(k) additions where
k is the size of R, whether R is a solution. The certifier is R and the verifier is the
algorithm that sums the elements of R and checks whether this sum equals m. Such
problems that can be verified with a nondeterministic random input in polynomial
time are said to be in complexity class Nondeterministic Polynomial (NP). Clearly,
all of the problems in P have verifiers that have polynomial running time and hence
P ⊆ N P. Whether P = NP is not known but the opposite is widely believed by
computer scientists.
NP-Complete problems constitute a subset of problems in NP and they are as
hard as any problem in NP. Formally, a decision problem C is NP-Complete if
it is in NP and every problem in NP can be reduced to C in polynomial time.
44 3 Graphs, Algorithms, and Complexity
NP−complete
NP
P
NP-hard problems are the problems which do not have any known polynomial time
algorithms and solving one NP-hard problem in polynomial time implies all of the
NP-hard problems that can be solved in polynomial time. In other words, NP-hard
is a class of problems that are as hard as any problem in NP. For example, finding the
least cost cycle in a weighted graph is an optimization problem which is NP-hard.
Figure 3.11 displays the relations between these complexity classes.
3.4.1 Reductions
(a) (b)
Fig. 3.12 Reduction from independent set to clique. a An independent set of a graph G. b The
clique of G using the same vertices
[5], our aim is to search for approximation algorithms for vertex cover. Algorithm
3.8 displays our approximation algorithm for the vertex cover problem. At each iter-
ation, a random edge (u, v) is picked, its endpoint vertices are included in the cover
and all edges incident to these edges are deleted from G. This process continues until
there are no more edges left and since all edges are covered, the output is a vertex
cover. We need to know how close the size of the output is to the optimum VC.
This algorithm in fact finds a maximal matching of a graph G and includes the
endpoints of edges found in the matching to the vertex cover set. Let M be a maximal
matching of G. The minimum vertex cover must include at least one endpoint of each
edge (u, v) ∈ M. Hence, the size of minimum vertex cover is at least as the size
of M. The size of the cover returned by the algorithm is 2|M| as both ends of the
matching are included in the cover. Therefore,
|V C| = 2|M| ≤ 2 × |MinV C| (3.8)
where VC is the vertex cover returned by the algorithm and MinVC is the minimum
vertex cover. Therefore, the approximation ratio of this algorithm is 2. The running
of this algorithm in a sample graph as shown in Fig. 3.13 results in a vertex cover
that has a size of 6 which is 1.5 times the optimum size of 4 shown in Fig. 3.13b.
(a) (b)
Fig. 3.13 Vertex Cover Examples. a The output of the approximation algorithm. b The optimal
vertex cover
3.4 NP-Completeness 47
We have reviewed the basic background to analyze biological sequences and bio-
logical networks in this chapter. Graph theory with its rich background provides a
convenient model for efficient analysis of biological networks. Our review of graphs
emphasized concepts that are relevant to biological networks rather than being com-
prehensive. We then provided a survey of algorithmic methods again with emphasis
on the ones used in bioinformatics. The dynamic programming and the graph algo-
rithms are frequently used in bioinformatics as we will see. Finally, we reviewed
the complexity classes P, NP, NP-hard, and NP-complete. Most of the problems we
encounter in bioinformatics are NP-hard which often require the use of approxima-
tion algorithms or heuristics. Since the size of data is huge in these applications,
we may use approximation algorithms or heuristics even if an algorithm in P exists
for the problem at hand. For example, if we have an exact algorithm that solves a
problem in O(n 2 ) time, we may be interested in an approximation algorithm with
a reasonable approximation ratio that finds the solution in O(n) time for n 1.
Similarly, the size of data being large necessitates the use of parallel and distrib-
uted algorithms which provide faster solutions than the sequential ones even if the
problem is in P, as we will see in the next chapter.
A comprehensive review of graph theory is provided by West [9] and Harary [6].
Cormen et al. [1], Skiena [8] and Levitin [7] all provide detailed descriptions of key
algorithm design concepts.
48 3 Graphs, Algorithms, and Complexity
Exercises
1. Find the adjacency matrix and the adjacency list representation of the graph
shown in Fig. 3.14.
2. Prove that each vertex of as graph G has an even degree if G has an Eularian
trail.
3. The exchange sort algorithm sorts the numbers in an array of size n by first
finding the maximum value of the array, swapping the maximum value with the
value in the first place, and then finding the maximum value of the remaining
(n −1) elements and continue until there are two elements. Write the pseudocode
of this algorithm; show its iteration steps in the array A = {3, 2, 5, 4, 6} and work
out its time complexity.
4. Modify the algorithm that finds the Fibonacci numbers (Algorithm 3.3) such
that only two memory locations which show the last two values of the sequence
are used.
5. Work out the BFS and DFS trees rooted at vertex b in the graph of Fig. 3.15 by
showing the parents of each vertex except the root.
6. Sort the weights of the graph of Fig. 3.16 in ascending order. Then, starting
from the lightest weight edge, include edges in the MST as long as they do
not form cycles with the existing edges in the MST fragment obtained so far.
This procedure is known as Kruskal’s algorithm. Write the pseudocode for this
algorithm and work out its time complexity.
1 2
6 3
5 4
a b c d
i j
h g f e
3 9
a b c
8 4
7 h
g 6 10
12
11 14
f e d
5 13
2
h g f e
7. Modify Dijkstra’s shortest path algorithm (Algorithm 3.7) so that the shortest
path tree information in terms of predecessors of vertices are stored.
8. Find the minimum dominating set and the minimum vertex cover of the graph
in Fig. 3.17.
9. Show that the independent set problem can be reduced to vertex cover problem
in polynomial time. (Hint: If V is an independent set of G(V, E), then V \ V
is a vertex cover).
References
1. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The
MIT Press, Cambridge
2. Edmonds J (1965) Paths, trees, and flowers. Can J Math 17:449–467
3. Erciyes K (2014) Complex networks: an algorithmic perspective. CRC Press, Taylor and Francis
4. Fiedler M (1989) Laplacian of graphs and algebraic connectivity. Comb Graph Theory 25:57–70
5. Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-
completeness. W. H. Freeman, New York
6. Harary F (1979) Graph theory. Addison-Wesley, Reading
50 3 Graphs, Algorithms, and Complexity
7. Levitin A (2011) Introduction to the design and analysis of algorithms, 3rd edn. Pearson Inter-
national Edn. ISBN: 0-321-36413-9
8. Skiena S (2008) The algorithm design manual. Springer, ISBN-10: 1849967202
9. West DB (2001) Introduction to graph theory, 2nd edn. Prentice-Hall, ISBN 0-13-014400-2
Parallel and Distributed Computing
4
4.1 Introduction
The terms computing, algorithm and programming are related to each other but
have conceptually different meanings. An algorithm in general is a set of instruc-
tions described frequently using pseudocode independent of the hardware, the oper-
ating system and the programming language used. Programming involves use of
implementation details such as operating system constructs and the programming
language. Computing is more general and includes methodologies, algorithms, pro-
gramming and architecture.
A sequential algorithm consists of a number of instructions executed consecu-
tively. This algorithm is executed on a central processing unit (CPU) of a computer
which also has memory and input/output units. A program is compiled and stored in
memory and each instruction of the program is fetched from the memory to the CPU
which decodes and executes it. In this so called Von Neumann model, both program
code and data are stored in external memory and need to be fetched to the CPU.
Parallel computing is the use of parallel computers to solve computational prob-
lems faster than a single computer. It is widely used for computationally difficult
and time consuming problems such as climate estimation, navigation and scientific
computing. A parallel algorithm executes simultaneously on a number of processing
elements with the processing elements being configured into various architectures.
A fundamental issue in parallel computing architecture is the organization of the
interconnection network which provides communication among tasks running on
different processing elements. The tasks of the parallel algorithm may or may not
share a commonly accessible global memory. The term parallel computing in general,
implies tightly coupling between the tasks as we will describe.
Distributed computing on the other hand, assumes the communication between
the tasks running on different nodes of the network communicate using messages
only. Distributed algorithms are sometimes called message-passing algorithms for
this reason. A network algorithm is a type of distributed algorithm where each node
is typically aware of its position in the network topology; cooperates with neighbor
© Springer International Publishing Switzerland 2015 51
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_4
52 4 Parallel and Distributed Computing
nodes to solve a network problem. For example, a node would search for shortest
paths from itself to all other nodes by cooperating with its neighbors in a network
routing algorithm. In a distributed memory routing algorithm on the other hand,
we would attempt to solve shortest paths between all nodes of a possibly large net-
work graph using few processing elements in parallel. Both can be called distributed
algorithms in the general sense.
We start this chapter by the review of fundamental parallel and distributed com-
puting architectures. We then describe the needed system support for shared-memory
parallel computing and review multi-threaded programming by providing examples
on POSIX threads. The parallel algorithm design techniques are also outlined and the
distributed computing section is about message passing paradigm and examples of
commonly used message passing software are described. We conclude the analysis
by the UNIX operating system and its network support for distributed processing
over a computer network.
(a) (b)
P1 P2 ... PN
Fig. 4.1 Sample interconnection networks with each processor having switching capabilities. a A
shared medium. b A 2-D mesh. Sample concurrent paths are shown in bold in (b)
(a) (b)
6 7
2 3
4 5
0 1
Fig. 4.2 Sample interconnection networks. a A 3-dimension hypercube. b A binary tree network.
Circles are the switching elements, squares are the processors. Three sample concurrent paths are
shown in bold in both figures
(a) (b)
P1 P2 ... PN P1 PN
Shared Input /
Memory Output
is generally more tightly coupled and more homogeneous than a distributed system
although technically, a distributed system is a multicomputer system as it does not
have shared memory.
employed within the CPU using instruction level parallelism (ILP) by executing
independent instructions at the same time, however, the physical limits of this method
have been reached recently and parallel processing at a coarser level has re-gained
its popularity among researchers.
A parallel algorithm is analyzed in terms of its time, processor and work complexities.
The time complexity T (n) of a parallel algorithm is the number of time steps required
to finish it. The processor complexity P(n) specifies the number of processors needed
and the work complexity W (n) is the total work done by all processors where W =
P × T . A parallel algorithm for a problem A is more efficient than another parallel
algorithm B for the same problem if WA < WB .
Let us consider a sequential algorithm that solves a problem A with a worst case
running time of Ts (n) which is also an upper bound for A. Furthermore, let us assume
a parallel algorithm does W (n) work to solve the same problem in Tp (n) time. The
parallel algorithm is work-optimal if W (n) = O(T (n)). The speedup Sp obtained by
a parallel algorithm is the ratio of the sequential time to parallel time as follows:
Ts (n)
Sp = (4.1)
Tp (n)
and the efficiency Ep is,
Sp
Ep = (4.2)
p
which is a value between zero and one, with p being the number of processors. Ideally
Sp should be equal to the number of processors but this will not be possible due to
the overheads involved in communication among parallel processes. Therefore, we
need to minimize the interprocess communication costs to improve speedup. Placing
many processes on the same processor decreases interprocess communication costs
as local communication costs are negligible, however, the parallelism achieved will
be greatly reduced. Load balancing of parallel processes involves distributing the
processes to the processors such that the computational load is balanced across the
processors and the inter-processor communications are minimized.
This model is commonly used for parallel algorithm design as it discards various
implementation details such as communication and synchronization. It can further
be classified into the following subgroups:
• Exclusive read exclusive write (EREW): Every memory cell can be read or written
by only one processor at a time.
• Concurrent read exclusive write (CREW): Multiple processors can read a memory
cell but only one can write at a time.
• Concurrent read concurrent write (CRCW): Multiple processors can read and write
to the same memory location concurrently. A CRCW PRAM is sometimes called
a concurrent random-access machine.
Exclusive read concurrent write (ERCW) is not considered as it does not make
sense. Most algorithms for PRAM model use SIMD model of parallel computation
and the complexity of a PRAM algorithm is the number of synchronous steps of the
algorithm.
A[0:15] P0
A[0:7] A[8:15]
P0 P1
A[0:3] A[4:7] A[8:11] A[12:15]
P0 P1 P2 P3
A[2:3] A[4:5] A[6:7] A[8:9] A[10:11]
A[0:1] A[14:15]
P0 P1 P2 P3 P4 P5 P6 P7
A[12:13]
0 1 3 2 . . . 15
A
Algorithm 4.1 shows one way of implementing this algorithm using EREW nota-
tion where the active processors are halved in each step for a total of logn steps. We
could modify this algorithm so that contents of the array A are not altered by starting
with 2p processor which copy A to another array B (see Exercise 4.2).
The time complexity of this algorithm is log n and in the first step, each processor
pi performs one addition for a total of n/2 additions. In the second step, we have
n/4 processors each performing 2 additions and it can readily be seen that we have a
total of n/4 summations at each step. The total work done in this case is (n log n)/4).
We could have a sequential algorithm that finds the result in only n − 1 steps. In this
case, we find the work done by this algorithm is not optimal.
We will assume that a parallel program consists of tasks that can run in parallel.
A task in this sense has code, local memory and a number of input/output ports.
Tasks communicate by sending data to their output ports and receive data from their
input ports [12]. The message queue that connects an output port of a task to an
input port of another task is called a channel. A task that wants to receive data
from one of its input channels is blocked if there is no data available in that port.
However, a task sending data to an output channel does not usually need to wait for
the reception of this data by the receiving task. This model is commonly adopted
in parallel algorithm design where reception of data is synchronous and the sending
of it is asynchronous. Full synchronous communication between two tasks requires
the sending task to be blocked also until the receiving task has received data. A
four-step design process for parallel computing was proposed by Foster consisting
of partitioning, communication, agglomeration and mapping phases [5].
Partitioning step can be performed on data, computation or both. We may divide
data into a number of partitions that may be processed by a number of processing
elements which is called domain decomposition [12]. For example, if the sum of an
array of size n using p processors is needed, we can allocate each processor n/p data
58 4 Parallel and Distributed Computing
items which can perform addition of these elements and we can then transfer all of
the partial sums to a single processor which computes the total sum for output. In
partitioning computation, we divide the computation into a number of modules and
then associate data with each module to be executed on each processor. This strategy
is known as functional decomposition and sometimes we may employ both domain
and functional decompositions for the same problem that needs to be parallelized.
The communication step involves the deciding process of the communication
among the tasks such as which task sends or receives data from which other tasks.
The communication costs among tasks that reside on the same processor can be
ignored, however, the interprocess communication costs using the interconnection
network is not trivial and needs to be considered.
The agglomeration step deals with grouping tasks that were designed in the first
two steps so that the interprocess communication is reduced. The final step is the
allocation of the task groups to processors which is commonly known as the mapping
problem. There is a trade off however, as grouping tasks onto the same processor
decreases communication among them but results in less parallelism. One way of
dealing with this problem is to use a task dependency graph which is a directed graph
representing the precedences and communications of the tasks. The problem is then
reduced to the graph partitioning problem in which a vast amount of literature exists
[1]. Our aim in general is to find a minimum weight cut set of the task graph such
that communication between tasks on different processors is minimized and the load
among processors is balanced.
Figure 4.5 shows such a task dependency graph where each task has a known
computation time and the costs of communications are also shown. We assumed the
PA1 PA2
1 task id
3 execution time
2 2 1
2 3
3 6 2
4 12
4 5 6
2 3 7 8
1
7
1 7
2
4
4
8
5
PA1
P2 1 3 6
8
P1 2 4 5 7
5 10 t 20 30 40
PA2
P2 3 5 6
P1 1 2 4 7 8
5 10 t 20 30 40
computation times of tasks are known beforehand which is not realistic in a general
computing system but is valid for many real-time computing systems. As a general
rule, a task Ti that receives some data from its predecessors cannot finish before
its predecessors finish, simply because we need to make sure that all of its data has
arrived before it can start its execution. Otherwise, we cannot claim that Ti will finish
in its declared execution time.
We can partition these tasks to two processors P1 and P2 with partition PA1
where we only consider the number of tasks in each partition resulting in a total
IPC cost of 21; or PA2 in which we consider IPC costs with a total cost of 8 for
IPC. These schedules are displayed by the Gantt charts [6] which show the start and
finishing times of tasks in Fig. 4.6. The resulting total execution times are 41 and 32
for partitions PA1 and PA2 from which we can deduce PA2 which has a lower IPC
cost and shorter overall execution time is relatively more favorable.
schedule
READY RUN
time expires
it is waiting. When that event occurs, it is made ready again to be scheduled by the
operating system. A running process may have its time slice expired in which case it
is preempted and put in the ready state to be scheduled when its turn comes. The data
about a process is kept in the structure called process control block (PCB) managed
by the operating system.
Having processes in an operating system provides modularity and a level of par-
allelism where CPU is not kept idle by scheduling a ready process when the cur-
rent process is blocked. However, we need to provide some mechanisms for these
processes to synchronize and communicate so that they can cooperate. There are
two types of synchronization among processes; mutual exclusion and conditional
synchronization as described next.
P1 P2
1. LOAD R2,m[x] 1. LOAD R5,m[x]
2. INC R2 2. INC R5
3. STO m[x],R2 3. STO m[x],R5
Now, let us assume P1 executes first, and just before it can execute line 3, its
time slice expires and the operating system performs a context switch by storing its
variables including R2 which has 13 in its PCB. P2 is executed afterwards which
4.3 Parallel Computing 61
executes by reading the value of 12 into register R15, incrementing R5 and storing
its value 13 in x. The operating system now schedules P1 by loading CPU registers
from its PCB and P1 writes 13 on x location which already contains 13. We have
incremented x once instead of twice as required.
As this example shows, some parts of the code of a process has to be executed
exclusively. In the above example, if we had provided some means so that P1 and P2
executed lines 1–3 without any interruption, the value of x would be consistent. Such
a segment of the code of a process which has to be executed exclusively is called a
critical section. A simple solution to this problem would be disabling of interrupts
before line 1 and then enabling them after line 3 for each process. In user C code
for example, we would need to enclose the statement x = x + 1 by the assembly
language codes disable interrupt (DI) and enable interrupt (EI). However, allowing
machine control at user level certainly has shortcomings, for example, if the user
forgets enabling interrupts then the computation would stop completely.
There are various methods to provide mutual exclusion while a critical section
is executed by a process and the reader is referred to [16] for a comprehensive
review. Basically, we can classify the methods that provide mutual exclusion at
hardware, operating system or application (algorithmic) level. At hardware level,
special instructions that provide mutual exclusion to the memory locations are used
whereas algorithms provide mutual exclusion at application level. However, it is not
practical to expect a user to implement algorithms when executing a critical section,
instead, operating system primitives for this task can be appropriately used. Modern
operating systems provide a mutual exclusion data structure (mutex) for each critical
section and two operations called lock and unlock on the mutex structure. The critical
section can then be implemented by each process on a shared variable x as follows:
mutex m1;
int x /* shared integer */
process P1{
... /* non-critical section */
lock(&m1);
x++; /* critical section */
unlock(&m1);
... /* non-critical section */
}
The lock and unlock operations on the mutex variables are atomic, that is, they
cannot be interrupted, as provided by the operating system.
A semaphore can be used for both mutual exclusion and conditional synchroniza-
tion. The following example demonstrates these two usages of semaphores where two
processes producer and consumer synchronize using semaphores. The semaphore
s1 is used for mutual exclusion and the semaphores s2 and s3 are used for waiting
and signalling events.
sem_t s1,s2,s3;
int shared;
wait(s1); out_data=shared;
shared = in_data; signal(s1);
signal(s1); signal(s2);
signal(s3); print out_dat;
} }
} }
The shared memory location which should be mutually accessed is shared and
is protected by the semaphore s1. The producer continuously reads data from the
keyboard, and first waits on s2 to be signalled by the consumer indicating it has read
the previous data. It then writes the input data to the shared location and signals
the consumer by the s3 semaphore so that it can read shared and display the data.
Without this synchronization, the producer may overwrite previous data before it is
read by the consumer; or the consumer may read the same data more than once.
A thread is a lightweight process with an own program counter, stack and register
values. It does not have the memory page references, file pointers and other data
that an ordinary process has, and each thread belongs to one process. Two different
types of threads are the user level threads and the kernel level threads. User level
threads are managed by the run-time system at application level and the kernel which
is the core of an operating system that handles basic functions such as interprocess
communication and synchronization, is not aware of the existence of these threads.
Creation and other management functions of a kernel thread is performed by the
kernel.
A user level thread should be non-blocking as a single thread of a process that
does a blocking system call will block all of the process as the kernel sees it as
one main thread. The kernel level threads however can be blocking and even if one
thread of a process is blocked, kernel may schedule another one as it manages each
thread separately by using thread control blocks which store all thread related data.
The kernel threads are a magnitude or more times slower than user level threads due
to their management overhead in the kernel. The general rule is to use user level
threads if they are known to be non-blocking and use kernel level threads if they are
blocking, at the expense of slowed down execution. Figure 4.8 displays the user level
and kernel level threads in relation to the kernel.
Using threads provide the means to run them on multiprocessors or multi-core
CPUs by suitable scheduling policies. Also, they can share data which means they do
not need interprocess communication. However, the shared data must be protected
by issuing operating system calls as we have seen.
64 4 Parallel and Distributed Computing
(a) (b)
Fig. 4.8 a User level threads. b Kernel level threads which are attached to thread control blocks in
the kernel
where thread_id is the variable where the created thread identifier will be stored after
this system call, start_function is the address of the thread code and the arguments
are the variables passed to the created thread.
The following example illustrates the use of threads for parallel summing of an
integer array A with n elements. Each thread sums the portion of the array defined by
its identity passed to it during its creation. Note that we are invoking the same thread
code with different parameter (i) each time resulting in summing a different part of
the array A, for example, thread 8 sums A[80 : 89]. For this example, we could have
used user threads as they work independently without getting blocked, however, we
used kernel threads as POSIX threads API allows the usage of kernel threads only.
Solaris operating system allows both user and kernel threads and provide flexibility
but this API is not a standard [15].
#include <stdio.h>
#include <pthread.h>
#define n 100
#define n_threads 10
int A[n]=2,1,...,8; /* initialize */
pthread_mutex_t m1;
4.3 Parallel Computing 65
int total_sum;
main()
{ pthread_t threads[n];
int i;
pthread_mutex_init(&m1);
for(i=1; i<=n_threads; i++)
pthread_create(&threads[i],NULL,worker,i);
for(i=1; i<=n_threads; i++)
pthread_join(threads[i],NULL);
printf("Total sum is = %d", total_sum);
}
We need to compile this program (sum.c) with the POSIX thread library as follows:
#include <semaphore.h>
sem_t s1;
sem_init(&s1, 1, 1)
The wait and signal operations on semaphores are sem_wait and sem_signal
respectively. In the following example, we will implement the producer/consumer
example of Sect. 4.3.4 by using two threads and two semaphores, full and empty.
#include <stdio.h>
#include <pthread.h>
#include <semaphore.h>
scanf("%d", data);
sem_post(&full);
i++;
}
}
main()
{ pthread_t prod, cons;
int i;
sem_init(&full,1,0);
sem_init(&empty,1,1);
pthread_create(&prod,NULL,producer,NULL);
pthread_create(&cons,NULL,consumer,NULL);
pthread_join(prod,NULL);
pthread_join(cons,NULL);
}
UNIX is a multitasking operating system developed at Bell Labs first and at Uni-
versity of California at Berkeley with network interface extension, named Berkeley
Software Distribution (BSD). It is written in C for the most part, has a relatively
smaller size than other operating systems, is modular and distributed in source code
which make UNIX as one of the most widely used operating systems. The user
interface in UNIX is called shell, and the kernel which performs the core operating
system functions such as process management and scheduling resides between the
shell and the hardware as shown in Fig. 4.9.
4.3 Parallel Computing 67
Shell
Kernel
Hardware
UNIX is based on processes and a number of system calls provide process man-
agement. A process can be created by a process using the system call fork. The caller
becomes the parent of the newly created process which becomes the child of the
parent. The fork system call returns the process identifier as an unsigned integer to
the parent, and 0 to the child. We can create a number of parallel processes using this
structure and provide concurrency where each child process is created with a copy
of the data area of the parent and it runs the same code as the parent. UNIX provides
various interprocess communication primitives and pipes are one of the simplest. A
pipe is created by the pipe system call and two identifiers, one for reading from a
pipe and one for writing to the pipe are returned. Reading and writing to a pipe are
performed by read and write system calls, similar to file read and write operations.
The following describes the process creation and interprocess communication using
pipes in UNIX. Our aim in this program to add the elements of an array in parallel
using two processes. The parent creates two pipes one for each direction and then
forks a child, sends the second half of the array to it by the pipe p1 to have it calculate
the partial sum, and calculates the sum of the lower portion of the array itself. It then
reads the sum of the child from pipe p2 and finds the total sum and displays it.
#include <stdio.h>
#define n 8
int c, *pt, p1[2], p2[2], A[n], sum=0, my_sum=0;
main()
{ pipe(p1); /* create two pipes one for each direction */
pipe(p2); /* p1 is from parent to child, */
c=fork(); /* p2 is from child to parent */
if(c!=0) { /* this is parent */
close(p1[0]); /* close the read end of p1 */
close(p2[1]); /* close the write end of p2 */
for(i=0;i<n;i++) /* initialize array */
A[i]=i+1;
write(p[1],&A[n/2],n/2); /* send the second half of array */
68 4 Parallel and Distributed Computing
for(i=0;i<n/2;i++)
my_sum=my_sum+A[i];
while(n=read(p2[0], pt, 1));
printf("Total sum is = %d", total_sum);
}
else { /* this is child */
close(p2[0]); /* close the read end of p2 */
close(p1[1]); /* close the write end of p1 */
while(n=read(p2[0], pt, n));
for(i=0;i<n/2;i++)
my_sum=my_sum+A[i];
write(p2[1],sizeof(int),sum); /* send partial sum */
}
}
}
Although threads are mostly used for shared memory parallel processing, we can still
have message-passing functionality using threads by additional routines at user level.
One such library is presented in [2] where threads communicate by message passing
primitives write_fifo and read_fifo which correspond to send and receive routines
of a message passing system. The communication channels between processes are
simulated by first-in-first-out (FIFO) data structures as shown below.
typedef struct {
sem_t send_sem;
sem_t receive_sem;
msgptr_t message_que[N_msgs];
} fifo_t;
and signalling its receiving semaphore so that it is activated. Using such a simulator
provides a simple testbed to verify distributed algorithms and when the algorithm
works correctly, performance tests can be performed using commonly used message-
passing tools as described next.
The Message Passing Interface Standard (MPI) provides a library of message passing
primitives in C or Fortran programming languages for distributed processing over a
number of computational nodes connected by a network [8,10]. Its aim is to provide
a portable, efficient and flexible standard of message passing. MPI code can be eas-
ily transferred to any parallel/distributed architecture that implements MPI. It can
be used in parallel computers, clusters and heterogeneous networks. It is basically
an application programming interface (API) to handle communication and synchro-
nization among processes residing at various nodes of a paralell/distributed comput-
ing system. MPI provides point-to-point, blocking and non-blocking and collective
communication modes. A previous commonly used tool for this purpose was Parallel
Virtual Machine (PVM) [7]. The basic MPI instructions to run an MPI program are
as follows.
#include <stdio.h>
#include <mpi.h>
The program when run by the mpirun command takes the number of processes as
an input which is 8 in this case. It then creates 7 other child processes by the MPI_Init
command to have a total of 8 processes. Each of these processes then execute their
4.4 Distributed Computing 71
program separately. We will have 8 “Hello world” outputs when this program is
executed. Clearly, it would make more sense to have each child process execute on
different data to perform parallel processing. In order to achieve this SPMD style
of parallel programming, each process has a unique identifier which can then be
mapped to the portion of data it should process. A process finds out its identifier by
the MPI_Comm_rank command and based on this value, it can execute the same code
on different data, achieving data partitioning. Two basic communication routines in
MPI are the MPI_Send and MPI_Receive with the following syntax:
where data is the memory location for data storage, n_data is the number of data
items to be transferred with the given type, receiver and the sender are the identifiers
of the receiving and the sending processes respectively, and tag is the type of message.
The next example shows how to find the sum of an array in parallel using a number
of processes. In this case, the root performs data partitioning by sending a different
part of an array to each child process which calculate the partial sums and return it
to the root which finds the total sum. The root first initializes the array and sends
the size of the portion along with array data to each process. Message types for each
communication direction are also defined.
#include <stdio.h>
#include <mpi.h>
MPI_Init(&argc, &argv);
array[i]=i;
}
n_portion=n_data/n_procs;
MPI_Recv( &array, n_portion, MPI_INT, root, tag2,
MPI_COMM_WORLD, &status);
partial_sum = 0;
for(i = 0; i < n_portion; i++)
partial_sum += array[i];
BSD UNIX provides the socket-based communications over the Internet. A server
is an endpoint of a communication between the two hosts over the Internet which
performs some function as required by a client process in network communica-
tions using sockets. Communication over the Internet can be performed either as
connection-oriented where a connection between a server and a client is established
and maintained during data transfer. Connection oriented delivery of messages also
guarantees the delivery of messages reliably by preserving the order of messages.
Transmission Control Protocol (TCP) is a connection oriented protocol provided at
layer 4 of the International Standards Organization (ISO) Open System Interconnect
(OSI) 7-Layer model. In connectionless communication mode, there is no established
or maintained connection and the delivery of messages is not guaranteed requiring
further processing at higher levels of communication. Unreliable Datagram Protocol
(UDP) is the standard connectionless protocol of the OSI model. Sockets may be
used for connection-oriented communications in which case they are called stream
sockets and when they are used for connectionless communication, they are called
datagram sockets. The basic system calls provided by BSD UNIX are as follows:
• socket: This call creates a socket of required type, whether connection-oriented or
connectionless, and the required domain.
• listen: A server specifies the maximum number of concurrent requests it can handle
on a socket by this call.
• connect: A client initiates a connection to a server by this procedure.
• accept: A server accepts client requests by this call. It blocks the caller until a
request is made. This call is used by a server for connection oriented communica-
tion.
• read and write: Data is read from and written to a socket using these calls.
• close: A connection is closed by this call on a socket.
Figure 4.10 displays a typical communication scenario of a connection-oriented
server and a client. The server and the client are synchronized by the blocking system
calls connect and accept after which a full connection is established using the TCP
protocol. The server then reads the client request by the read and responds by the
write calls. A server typically runs in an endless loop and in order to respond to
further incoming requests, it spawns a child for each request so that it can return
waiting for further client requests on the accept call. Threads may also be used for
this purpose resulting in less management overheads than processes.
We can use socket-based communication over the Internet to perform distributed
computations. Returning to the sum of an array example we have been elaborating,
a server process can send portions of an array to a number of clients using stream or
datagram sockets upon their requests and then can combine all of the partial results
from the clients to find the final sum (see Exercise 4.9).
74 4 Parallel and Distributed Computing
Server Client
socket
socket
bind
connect
listen
write
accept
read
read
close
write
face provides routines for creation, reading from and writing to data structures called
sockets. However, MPI provides a neater interface to the application with the addition
of routines for multicast and broadcast communication and is frequently employed
for parallel and distributed processing. Threads are frequently used for shared mem-
ory parallel processing and MPI can be employed for both parallel and distributed
applications.
Exercises
1. Provide a partitioning of the task dependency graph of Fig. 4.5 to three processors
which considers evenly balancing the load among the processors and minimizing
the IPC costs at the same time. Draw the Gant chart for this partitioning and work
out the total IPC cost and the execution time of this partitioning.
2. Modify Algorithm 1.1 such that array A is first copied to an array B and all of
the summations are done on array B to keep contents of A intact. Copying of A
to B should also be done in parallel using n processors.
3. The prefix sum of an array A is an array B with the running sums of the elements
of A such that b0 = a0 ; b1 = a0 + a1 ; b2 = a0 + a1 + a2 ; . . . For example if
A = {2, 1, −4, 6, 3, 0, 5} then B = {2, 3, −1, 5, 8, 8, 13}. Write the pseudocode
of an algorithm using EREW PRAM model to find the prefix sum of an input
array.
4. Provide an N-buffer version of the producer/consumer algorithm where we have
N empty buffers initially and the producer process fills these buffers with input
data as long as there are empty buffers. The producer reads filled buffer loca-
tions and consumes these data by printing them. Write the pseudocode of this
algorithm with short comments.
5. A multi-threaded file server is to be designed which receives a message by the
front end thread, and this thread invokes one of the open, read and write threads
depending on the action required in the incoming message. The read thread reads
the number of bytes from the file which is sent to the sender, the write thread
writes the specified number of contained bytes in the message to the specified
file. Write this program in C with POSIX threads with brief comments.
1 4
6. The PI number can be approximated by 0 (1+x 2 ) . Provide a multi-threaded C
program using POSIX threads, with 10 threads each working with 20 slices to
find PI by calculating the area under this curve.
7. Modify the program of Sect. 4.3.6 such that there are a total of 8 processes
spawned by the parent to find the sum of the array in parallel.
8. Provide a distributed algorithm for 8 processes with process 0 as the root. The
root process sends a request to learn the identifiers of the processes in the first
round and then finds the maximum identifier among them and notifies each
process of the maximum identifier in the second round.
9. Write an MPI program in C that sums elements of an integer array in a pipelined
topology of four processes as depicted in Fig. 4.11. The root process P0 computes
the sum of the first 16 elements and passes this sum along with the remaining
48 elements to P1 which sums the second 16 elements of the array and pass
76 4 Parallel and Distributed Computing
16 ... 64 32 ... 64
P1
P0 P2
48 ... 64
P3
sum 0−3
the accumulated sum to P2. This process is repeated until message reaches P0
which receives the total sum and displays it in the final step. Note that the first
extra element of the message transferred can be used to store the partial sum
between processes.
10. Provide the pseudocode of a program using UNIX BSD sockets where we have
a server and n clients. Each client makes a connection-oriented call to the server,
obtains its portion of an integer array A and returns the partial sum to the server.
The server displays the total sum of array A when it receives all of the partial
sums from the clients.
References
1. Elsner U (1997) Graph partitioning, a survey. Technical Report, Technische Universitat Chem-
nitz
2. Erciyes K (2013) Distributed graph algorithms for computer networks. Computer communi-
cations and networks series. Chap. 18, Springer. ISBN 978-1-4471-5172-2
3. Erciyes K (2014) Complex networks: an algorithmic perspective. CRC Press, Taylor and Fran-
cis, pp 57–59. ISBN 978-1-4471-5172-2
4. Flynn MJ (1972) Some computer organizations and their effectiveness. IEEE Trans. Comput.
C 21(9):948–960
5. Foster I (1995) Designing and building parallel programs: concepts and tools for parallel
software engineering. Addison-Wesley
6. Gantt HL (1910) Work, wages and profit. The Engineering Magazine, New York
7. http://www.csm.ornl.gov/pvm/
8. http://www.mcs.anl.gov/research/projects/mpi/
9. http://www.open-mpi.org/
10. https://computing.llnl.gov/tutorials/mpi/
11. POSIX.1 FAQ. The open group. 5 Oct 2011
References 77
12. Quinn M (2003) Parallel programming in C with MPI and OpenMP. McGraw-Hill Sci-
ence/Engineering/Math
13. Stevens WR (1998) UNIX network programming. In: Networking APIs: sockets and XTI, vol
1, 2nd edn. Prentice Hall
14. Stevens WR (1999) UNIX network programming. In: Interprocess communications, vol 2, 2nd
edn. Prentice Hall
15. Sun Microsystems Inc. (1991) SunSoft introduces first shrink-wrapped distributed computing
solution: Solaris. (Press release)
16. Tanenbaum A (2014) Modern operating systems 4th edn. Prentice-Hall
Part II
Biological Sequences
String Algorithms
5
5.1 Introduction
A string S consists of an ordered set of characters over a finite set of symbols called
an alphabet Σ. The size of the alphabet, |Σ|, is the number of distinct characters
in it and the size of a string, |S|, is the number of characters contained in it; also
called the length of the string. For example, the DNA structure can be represented by
a string over the alphabet of four nucleotides: Adenine (A), Cytosine (C), Guanine
(G) and Thymine (T), hence Σ = {A, C, G, T } for DNA; the RNA structure has
Σ = {A, C, G, U} replacing Thymine with Uracil (U), and a protein is a linear chain
of amino acids over an alphabet of 20 amino acids Σ = {A, R, N, . . . , V }.
String algorithms have numerous applications in molecular biology. Biologists
often need to compare two DNA/RNA or protein sequences to find the similarity
between them. This process is called sequence alignment and finding the relatedness
of two or more biological sequences helps to deduce ancestral relationships among
organisms as well as finding conserved segments which may have fundamental roles
for the functioning of an organism. We may also need to search for a specific string
pattern in a biological sequence as this pattern may indicate a location of a structure
such as a gene in DNA. In some other cases, biologists need to find repeating substring
patterns in DNA or protein sequences as these serve as signals in genome and also
over representations of these may indicate complex diseases. Analysis of genome as
substring rearrangements helps to find evolutionary relationships between organisms
as well as to understand the mechanisms of diseases. In summary, DNA/RNA and
protein functionalities are highly dependent on their sequence structures and we
need efficient string manipulation algorithms to understand the underlying complex
biological processes.
In this chapter, we start the analysis of strings by the string matching algorithms
where we search the occurrences of a small string inside a larger string. We will
see that there are a number of algorithms with varying time complexities. The next
problem we will investigate is the longest common subsequence problem where our
aim is to find the longest common subsequence in two strings. Longest increasing
© Springer International Publishing Switzerland 2015 81
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_5
82 5 String Algorithms
We will now review fundamental sequential algorithms for string matching starting
with the naive algorithm which is a brute force algorithm that checks every pos-
sible combination. As this is a time consuming approach, various algorithms were
proposed that have lower time complexities.
from the first locations in T and P and compare their elements. As long as we find
a match, the indices are incremented and if the size of the pattern m is reached, a
match is found and output as shown in Algorithm 5.1.
Let us consider the running of this algorithm until the first match, for the example
above:
T = a b c c b a b a c c b a a b
P = c c b a
c c b a
c c b a (a match found at location 3)
The time taken by this algorithm is O(nm) since checking each occurrence of P
is done in m comparisons and we need to check the first n − m + 1 locations of T .
For very long strings such as the biological sequences, the total time taken will be
significantly high. Baeza-Yates showed that the expected time taken in this algorithm
for an alphabet of size c is [1]:
c 1
C= (1 − m (n − m + 1) + O(1) (5.1)
c−1 c
• Σ is an input alphabet
• S is a finite nonempty set of states
• s0 ∈ S is the initial state
• δ : S × Σ → S is the state transition function
defined as a function address of which is placed in this table. The algorithm then
simply directs the flow to the function specified by the current state and the current
input in the table as shown below. The states are now labeled as 0, 1, 2, 3 and 4, and
the inputs are 0, 1, and 2 corresponding to a, b, and c. The time to construct the FSM
is Θ(m|Σ|) which is the size of the table, assuming there are no overlapping states.
The time to search the input string is Θ(n), resulting in a total time complexity of
Θ(m|Σ| + n) for this algorithm.
Fig. 5.2 The computation of the prefix array Π for the pattern P = {acacaba}
The formation of the prefix table Π is performed at initialization using the pattern
P only. The idea here is to check incrementally whether any proper prefix of a pattern
is also a proper suffix of it. If this is the case, the integer value is incremented. The
entry Π [i] is the largest integer smaller than i such that prefix p1 , . . . , pπi is also a
suffix of p1 , . . . , pi . As an example, let us assume the pattern P = {acacaba} and
work out the values of Π as shown in Fig. 5.2.
The pseudocode of the prefix function to fill Π is shown in Algorithm 5.2.
Once this array is formed, we can use it when there is a mismatch to shift the
indices in the text T . The operation of the KMP algorithm is described in pseudocode
in Algorithm 5.3.
Prefix : 0 0 1 2 3 1 1
S = c c a c a c a c a b a c a b
P = a c a c a b a (first match found at location 3)
a c a c a b a (second match)
a c a c a b a (third match)
a c a c a b a (fourth match)
a c a c a b a (fifth match)
a c a c a b a (mismatch at 6, shift 1)
a c a c a b a (mismatches at 1,2 and 3, shift 1 at 3)
a c a c a b a (first match found at location 3)
a c a c a b a (second match)
a c a c a b a (third match)
a c a c a b a (fourth match)
a c a c a b a (fifth match)
a c a c a b a (sixth match)
a c a c a b a (FULL MATCH)
88 5 String Algorithms
This algorithm takes O(m) time to compute the prefix values and O(n) to compare
the pattern to the text, resulting in a total of O(m + n) time. The running time of
this algorithm is optimal and can process large texts as it never needs to move
backward. However, it is sensitive to the size of the alphabet |Σ| as there will be
more mismatches when this increases.
T= a b c c b a a b b a c
P= a c b c b a mismatch at position 3
i=3, k=3, T[3]=c, i-R[c]=-1, shift 1
5.2 Exact String Matching 89
b c c b c | b a a c b | b c a b b
p1 p2 p3
Neither the process p2 nor p1 will be aware of the match because they do not
have the whole of the pattern cba. A possible remedy for this situation would be to
provide the preceding process, which is p1 in this case, with the first m − 1 characters
of its proceeding process. Searching for matches in these bordering regions would
then be the responsibility of the preceding process with a lower identifier. Assuming
the communication costs are negligible, this approach which is sometimes called
embarrassingly parallel will provide almost linear speedup.
5.3 Approximate String Matching 91
The pseudocode for this algorithm is shown in Algorithm 5.6. As in the naive
exact algorithm, it has a time complexity of O(nm).
S = a b b a b c b a a b c c a
A = b a b c
B = a - c b a - - c
C = b a b - b
5.4 Longest Subsequence Problems 93
The following example shows the LCS P of two strings A and B. In general, we
are more interested in finding the length of LCS rather than its contents.
A = a b b a b c b a b a
B = b b a c c c a a c
P = b b a c a a
1. Case 1: Considering any two elements A[i] and B[j], the first case is they may be
equal. In this case of A[i] = B[j], the length of current LCS should be incremented.
This relation can be stated recursively as follows:
LCS[i, j] = 1 + LCS[i − 1, j − 1] (5.2)
2. Case 2: When A[i] = B[j], either A[i] or B[j] has to be discarded, therefore:
LCS[i, j] = max(LCS[i − 1, j], LCS[i, j − 1]) (5.3)
Using these two cases, we can form the dynamic programming solution to this
problem using two nested loops as shown in Algorithm 5.7. The array C holds the
partial and final solutions to this problem.
94 5 String Algorithms
For two strings A = {abacc} and B = {babbc}, let us form the output matrix C as
below. The time to fill this matrix is Θ(nm) with a total space Θ(nm), and the final
length of LCS is in the last element of C, that is, C[n, m] = |LCS(A, B)|. In order
to find the actual LCS sequence, we start moving up from the right corner of the
matrix which holds the element C[n, m] and backtrack the path we have followed
while filling the matrix. The sequence elements discovered in this way are shown
by arrows in the matrix and the final LCSes are {bacc} and {abcc} as there are two
paths.
proposed in [6] with O(nm/w) time and O(m/w) space complexities where w is the
size of the machine word and n and m are the lengths of the two strings.
Dominant points are the minimal points of search and using them narrows the
search space efficiently. Studies using dominant points to solve multiple LCS problem
in parallel for more than two input strings have been reported in [7,8]. A parallel
algorithm to solve LCS problem on graphics processing units (GPUs) was proposed
by Yang et al. [9].
As another example, let us consider the longest increasing subsequence (LIS) prob-
lem. Given a sequence S = {a1 a2 , . . . , an } of numbers, L is the longest subsequence
of S where ∀ai , aj ∈ L, ai ≤ aj if i ≤ j. For example, given S = {7, 2, 4, 1, 6, 8, 5, 9},
L = {2, 4, 6, 8, 9}. A dynamic programming solution to the problem would again
start from the subproblems and store the results to be used in future. Algorithm 5.8
shows how to find LIS using dynamic programming in O(n2 ) time [10]. The proce-
dure Max_V al is used to find the maximum of numbers in an array in order to find
the length of the longest path stored in the array L.
The longest common increasing subsequence (LCIS) is more general than LIS in
which we search for an LIS of two or more sequences. Formally, given two strings
A = (a1 , . . . , am ) and B = (b1 , . . . , bn ) over an ordered alphabet Σ = {x1 < x2 <
x3 }, the LCIS of A and B is the common subsequence of A and B of the longest
length. An example LCIS L of two strings A and B is shown below:
A = 3 6 1 4 2 5 7 8 2
B = 2 3 8 4 1 9 5 1 8
P = 3 4 5 8
96 5 String Algorithms
A suffix tree introduced by Weiner [11] is a data structure that represents suffixes
of a string. Suffix trees have numerous of applications in bioinformatics problems
including exact string matching and pattern matching. Formally, a suffix tree can be
defined as follows:
This representation of a suffix tree, however, does not guarantee a valid suffix tree
for any string S. If the prefix of a suffix of S is the same as its suffix, the path for
the suffix will not be represented by a leaf. It is therefore general practice to place a
special symbol such as $ at the end of S and hence every suffix of S ends with $ as
shown in Fig. 5.3 which represents a suffix tree for the string S = abbcbabc.
A generalized suffix tree contains suffix trees for two or more strings. In this
case, each leaf number of such a tree is represented by two integers that identify the
string and the starting position of the suffix represented by the leaf in that string as
shown in Fig. 5.4 where two strings S1 = acbab and S2 = bacb over an alphabet
There are various algorithms to construct a suffix tree T of a string S. Weiner was first
to provide a linear time algorithm to construct a suffix tree [11] which was modified
by McGreight to yield a space complexity of O(n2 ) [14]. Ukkonen proposed an
online algorithm which provided more space saving over McCreight’s algorithm
[15] and Farach provided an algorithm to construct suffix trees using unbounded
alphabet sizes [16]. We will describe sequential and parallel algorithms to construct
suffix trees starting with a naive algorithm and then the algorithm due to Ukkonen.
We will then investigate methods for parallel construction of suffix trees.
T1 consists of this suffix only. At each step i + 1, the tree Ti is traversed starting
at root r and a prefix of S[i + 1, n] that matches a path of Ti starting from the root
node is searched. If such a prefix is not found, a new leaf numbered i + 1 with edge
label S[i + 1, n$] is formed as an edge coming out of the root. If such a prefix of
S[i + 1, n$] exists as a prefix of a branch of Ti , we check symbols of the path in that
branch until a mismatch occurs. If this mismatch is in the midst of an edge (u, v),
we split (u, v) to form a new branch, otherwise if the mismatch occurs after a vertex
w, the new branch is joined to w as shown in Algorithm 5.9.
The execution of Algorithm 5.9 is shown in Fig. 5.5 for the string S = babcab.
We start with the longest suffix S[1..n]$ in (a) and connect it to the root r to form T1 .
The second suffix starting at location 2 does not have a matching path and therefore
a new branch to T1 is added to form T2 as shown in (b). The suffix S[3..n] has the
first character in common with the first suffix and therefore, we can split the first
suffix branch as shown in (c). Continuing in this manner, the whole suffix tree is
constructed.
The naive algorithm has a time complexity of O(n2 ) for a string S of length n
and requires O(n) space. The time needed for the ith suffix in ith iteration of this
algorithm is O(n − i + 1). Therefore, total time is:
n
n
= O(n − i + 1) = O(i) = O(n2 ) (5.4)
i=1 i=1
5.5 Suffix Trees 99
Fig. 5.5 Construction of a suffix tree for the string S = {babca} by the naive algorithm
Definition 5.3 [12] An implicit suffix tree Tim of a string S is obtained from the
suffix tree T of S by deleting all terminal $ symbols from edges of T , then deleting
all unlabeled edges and any nodes that have one child in T .
Ukkonen’s algorithm basically constructs implicit suffix trees and extends them
and it finally converts the implicit trees to explicit trees. This process is illustrated
in Fig. 5.6 in which a suffix tree is converted to an implicit suffix tree. We will now
briefly review Ukkonen’s algorithm at high level as described in [12]. The algorithm
has n phases and a tree Ii+1 is constructed from the tree Ii of the previous phase. In
the phase i + 1, there are i + 1 extensions and at extension j of phase i + 1, the suffix
S[1, j] is placed in the tree as shown in Algorithm 5.10.
100 5 String Algorithms
(a) a b (b) a b
c b c c b c
b b c b b b c
b
c c c
c
b a b a b a
b a
a b b
b a a a
b b b
b b c b b c
b b
a b a b
c c c c c c
Fig. 5.6 a A suffix tree for the string S = cbcbabcb, b The implicit suffix tree of S
In the first case, we find a prefix of the suffix and we append S[i + 1] to this
leaf. Otherwise, if S[i, j] is an internal node v, we form a new edge labeled S[i + 1]
and attach a leaf labeled j to this edge. In case 3, S[i, j] is not a leaf nor an internal
node but it is inside an edge label. In this case, we form a new node v at the end of
the S[i, j] and insert a new edge labeled S[i + 1] and a leaf j from v. Although in
this naive form, the complexity of this algorithm is O(n3 ) which is unacceptable for
long biological sequences, few tricks are used to reduce the time complexity to O(n)
which are the use of suffix links and edge-label compression.
5.5 Suffix Trees 101
Earlier parallel suffix tree construction methods such as in [18] were mainly
of theoretical in nature. Two relatively more recently reported parallel disk-based
(out-of-core) suffix tree construction algorithms are the Wavefront method [23] and
Elastic range (ERa) [19]. The wavefront method consists of a number of main steps
to construct a suffix tree in parallel. In the initial network string caching step, a
cache that consists of all main memories in the system is built. The following task
generation step involves finding a set of variable length prefixes P of the input string
S. The location of each prefix p ∈ P is discovered in the third step called prefix
location discovery. The subtrees are then constructed for each prefix p and finally are
combined to yield the final suffix tree. The complexity of this algorithm is reported
as O(n2 /k) where k is the number of processors. The wavefront method has been
experimented in distributed memory IBM BlueGene/L supercomputer with 1024
PowerPC 440 processors using MPI was for message passing and it was found to be
scalable.
ERa takes a slightly different approach by employing vertical and horizontal
partitions. The suffix tree is divided into a number of subtrees using variable length
prefixes in the vertical partitioning. Moreover, the subtrees are grouped to share
the input/output costs. Each subtree is further horizontally divided considering the
currently processed number of paths in the subtree. The subtrees are then combined
to yield the final suffix tree. The time complexity of this method is shown to be
O(n2 ) where n is the length of the input string. The serial and shared memory–
shared disk and shared-nothing parallel versions of this algorithm were presented.
In the distributed construction, a 16-node Linux cluster was used as the testbed and
significant speedups were reported when compared with the wavefront method [19].
A recent parallel suffix tree construction algorithm called parallel continuous flow
is presented in [24] where a suffix tree is built by using the suffix array and longest
common prefix structure.
There are numerous applications of suffix trees for string processing and biological
sequence analysis. Two such applications, the exact string matching and LCS finding
are relevant here and we will describe them next. We will see other suffix tree
applications in the next chapters when they are used for sequence alignment and
to find sequence repeats in DNA and proteins.
on this observation, we can search for pattern P[1, m] in S[1, n] by first forming the
suffix tree TS of S. We can then check each suffix starting from the root. Let S[i, n]
be a suffix under consideration. If S[i] = P[1] we can abort searching that branch
and continue with the next suffix. If there is match of P along a suffix branch, the
number of tree edges lower than P[m] will show the number of occurrences of the
pattern in S. This procedure consists of the following steps:
1,6
r
# a c
2,6 c
b
c c
b b a
a c
a c
v b
c # c a
#
Fig. 5.8 Longest common substring discovery using a suffix tree. The generalized suffix tree T for
the strings S1 = ccbac and S2 = bccba is first constructed and the internal node that has leaves of
both strings in its subtree are marked which are all of the internal nodes. The internal node v shown
in circle is the deepest one and the prefix ccba leading to v is the longest common substring of S1
and S2
In the original suffix array construction algorithm in [25], a method called prefix
doubling is used in which suffixes that have the same prefixes of length k are grouped
into buckets and the suffixes in a bucket are sorted using their first 2k characters. The
bucket numbers are updated and this process is repeated for a maximum of log n
rounds until all suffixes are in buckets of length 1 resulting in O(n log n) time [26].
A survey of suffix array construction algorithms can be found in [27].
We reviewed fundamental algorithms for strings starting with exact string matching
in this chapter. Approximate string matching refers to the case of matching when a
number of mismatches to the pattern are allowed. The first algorithm for exact string
matching called the naive algorithm checked all possible occurrences of the pattern in
the text at the expense of high time complexity of O(nm) which prohibits its use with
large strings such as biological sequences. We then described algorithms with bet-
ter time complexities. Knuth–Morris–Pratt algorithm is deterministic as it provides
asymptotic bounds but Boyer–Moore algorithm is probably the most favorable algo-
rithm for string matching due to its low complexity and also as its performance gets
better with increased alphabet size. Both of these algorithms require preprocessing
of tables to be used during pattern search. Table 5.3 displays the comparison of the
performances algorithms.
Rabin–Karp algorithm is a randomized algorithm that has a linear time complexity
in many cases [32]. It uses hashing to compute a signature for each m character
substring of the text T and then checks whether the signature of the pattern P is
equal to a signature of T . When there is a match, all substrings are searched to find
the actual match using the naive algorithm. Aho–Corasick algorithm constructs a
finite state machine of input keywords and uses it to find the pattern among the
keywords [33].
The distributed implementations of these algorithms are straightforward, we need
to partition the input string into a number of processors and each process should then
search for the pattern in its partition. The supervisor which is a designated process,
gathers all of the results obtained and combines these to obtain the final output.
However, there are only few studies on the parallelization of these algorithms on
distributed memory computing systems.
Suffix trees provide a convenient method of representing strings. Construction
of suffix trees can be performed by various algorithms including the naive method.
The space complexity of such an algorithm should also be considered as long strings
and their suffix trees will require significant memory. Hence, the algorithms can be
basically divided into in-core and out-of-core methods. We described two in-core
algorithms and the distributed algorithms to construct suffix trees are generally out-
of-core approaches which provide construction of subtrees in individual processors.
The string is typically partitioned into a number of segments and a subtree is formed
for each segment. The properties of a segment to be distributed varies for different
algorithms, typically a segment pair or prefixes can be used. The main schemes
partition the data into a number of processes and the supervisor gathers and combines
the results as before. Suffix arrays provide a compact way of storing suffix trees in
memory. They basically sort the suffixes of the tree and store the indexes and the string
only. The price paid, however, is the higher time complexity in various applications
such as pattern matching.
Exercises
1. Find the pattern ATTCT in the following DNA sequence using the naive algo-
rithm. Show all iterations of the algorithm.
A T G G C T A T T C T A T G G C T A G
References
1. Baeza-Yates R (1989) String searching algorithms revisited. In: Dehne F, Sack JR, Santoro N
(eds) Workshop in algorithms and data structures, Lecture notes on computer science, vol 382.
Springer, Ottawa, Canada, pp 75–96
2. Knuth DE, Morris JH, Pratt VR (1977) Fast pattern matching in strings. SIAM J Comput
6(2):323–350
3. Boyer RS, Moore JS (1977) A fast string searching algorithm. Commun ACM 20(10):761–772
4. Breslauer D, Galil Z (1993) A parallel implementation of Boyer-Moore string searching algo-
rithm. Sequences II:121–142
5. Ukiyama N, Imai H (1993) Parallel multiple alignments and their implementation on CM5. In:
Proceedings of genome informatics workshop, pp 103–108
6. Crochemore M, Iliopoulos CS, Pinzon YJ, Reid JF (2001) A fast and practical bit vector
algorithm for the longest common subsequence problem. Inform Process Lett 80(6):279–285
7. Korkin D, Wang Q, Shang Y (2008) An efficient parallel algorithm for the multiple longest
common subsequence (MLCS) problem. In: Proceedings 37th international conference on
parallel processing, pp 354–363
8. Liu W, Chen L (2006) A parallel algorithm for solving LCS of multiple bioseqences. In Pro-
ceedings fifth international conference on machine learning and cybernetics. Dalian, China, pp
4316–4321
9. Yang J, Xu Y, Shang Y (2010) An efficient parallel algorithm for longest common subsequence
problem on GPUs. In: Proceedings of world congress on engineering 2010, vol I, WCE 2010,
June 30-July 2, 2010, London, U.K
10. Erciyes K (2014) Complex networks: an algorithmic perspective. CRC Press Taylor and Francis,
pp 38–39, ISBN 9781466571662
11. Weiner P (1973) Linear pattern matching algorithms. In: Proceedings of 14th IEEE symposium
on switching and automata theory, pp 1–11
12. Gusfield D (1997) Algorithms on strings, trees and sequences. Computer science and compu-
tational biology. Cambridge University Press
13. Sung W-K (2009) Algorithms in Bioinformatics: a practical Introduction. Chapman &
Hall/CRC Mathematical and computational biology. Chap 3, November 24
14. McCreight E (1976) A space-economical suffix tree construction algorithm. J ACM 23(2):262–
272
15. Ukkonen E (1993) Approximate string-matching over suffix trees. Combinatorial pattern
matching, Springer LNCS vol 684, pp 228–242
16. Farach-Colton M, Ferragina P, Muthukrishnan S (2000) On the sorting-complexity of suffix
tree construction. J ACM 47(6):987–1011
17. Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260
18. Apostolico A, Iliopoulos C, Landau G, Schieber B, Vishkin U (1988) Parallel construction of
a suffix tree with application. Algorithmica 3:347–365
19. Mansour E, Allam A, Skiadopoulos S, Kalnis P (2011) Era: efficient serial and parallel suffix
tree construction for very long strings. Proc VLDB Endowment 5(1):49–60
20. Phoophakdee B, Zaki MJ (2007) Genome-scale disk-based suffix tree indexing. In: Proceedings
of ACM SIGMOD, pp 833–844
21. Barsky M, Stege U, Thomo A, Upton C (2009) Suffix trees for very large genomic sequences.
In: Proceedings of ACM CIKM, pp 1417–1420
22. Ghoting A, Makarychev K (2009) Serial and parallel methods for I/O efficient suffix tree
construction. In: Proceedings of ACM SIGMOD, pp 827–840
23. Ghoting A, Makarychev K (2009) Indexing genomic sequences on the IBM Blue Gene. In:
Proceedings of conference on high performance computing networking, storage and analysis
(SC), pp 1–11
110 5 String Algorithms
24. Matteo Comin M, Montse Farreras M (2014) Parallel continuous flow: a parallel suffix tree
construction tool for whole genomes. J Comput Biol 21(4):330–344
25. Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J
Comput 25(5):935–948
26. Rajasekaran S, Nicolae M (2014) An elegant algorithm for the construction of suffix arrays. J
Discrete Algorithms 27:21–28
27. Puglisi S, Smyth W, Turpin A (2007) A taxonomy of suffix array construction algorithms.
ACM Comput Surv 39(2)
28. Futamura N, Aluru S, Kurtz S (2001) Parallel suffix sorting. In: Proceedings of 9th international
conference on advanced computing and communications, pp 76–81
29. Dementiev R, Karkkainen J, Mehnert J, Sanders P (2008) Better external memory suffix array
construction. J Exp Algorithmics 12:3–4
30. Karkkainen J, Sanders P, Burkhardt S (2006) Linear work suffix array construction. J ACM
53(6):918–936
31. Kurtz S, Choudhuri J, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R (2001) Reputer: the
manifold applications of repeat analysis on a genome scale. Nucleic Acids Res 29(22):4633–
4642
32. Karp RM, Rabin MO (1987) Efficient randomized pattern-matching algorithms. IBM J Res
Dev 31(2):249–260
33. Aho AV, Corasick MJ (1975) Efficient string matching: an aid to bibliographic search. Comm
ACM 18(6):333–340
Sequence Alignment
6
6.1 Introduction
motif is a repeating DNA nucleotide or protein amino acid sequence which has a
biological significance. Sequence comparison methods are used to discover these
motifs as we will analyze in Chap. 8. In summary, the distances and similarities be-
tween biological sequences is required in various sequence analysis methods and the
alignment methods are usually the first step that provides the needed input to all of
these methods.
Two sequences are aligned in pairwise alignment and multiple sequences are
aligned in multiple alignment. In global alignment, two homologous sequences of
similar lengths are compared over their entire sequence. This method is used to
find similarities of two closely related sequences. In many cases, however, only
certain segments of two sequences may be similar but the rest of the sequences may
be completely unrelated. For example, two proteins may consist of a number of
domains and only one or two of these domains may be similar. The global alignment
will not display a high similarity between these two proteins in this case. Local
alignment refers to the method of finding similar regions in two sequences which
may have very different lengths. Multiple sequence alignment can be performed by
global alignment if the input sequences are closely related and we are searching
for the similarity of these sequences as a whole; or local alignment in which case
we are interested to find similar subsequences of otherwise not related sequences.
Different types of alignment methods need different algorithms; however, they can
be coarsely classified as dynamic programming based, heuristic or a combination of
both in general.
In this chapter, we first state the sequence alignment problem, describe ways
of evaluating goodness of any alignment method and then analyze representative
sequential global, local and multiple sequence alignment algorithms in detail. We
then provide parallel/distributed algorithms aimed to solve these problems and review
current research in this area.
Sequence alignment is the basic and most fundamental method of comparing two
biological sequences. In the very common application of such alignment, we have an
input sequence called the query that needs to be identified since it is newly discovered
or not aligned before; and this query is typically aligned with each sequence in a
database of sequences. The sequences in the database that have the highest scores are
then identified as the ones having highest similarity and therefore relatedness to the
query sequence. This affinity in base structures may imply phylogenetic relationships
and also similar functionality to aid the analysis of the newly discovered sequence.
We need to asses the quality of an alignment which reflects its goodness. The cost-
benefit approach identifies three scores during alignment:
6.2 Problem Statement 113
Insertion and deletion of gaps refer to the operations on the first sequence, that
is, insertion/deletion means inserting/deleting a gap to/from the first sequence. An
alignment between two DNA sequences X and Y is shown below with matches,
mismatches, insertions, and deletions.
A positive score is associated with a match and negative scores are used to penal-
ize a mismatch and an indel. The negative scores or penalties are based on observed
statistical occurrences of an indel and a mismatch and typically, the indels are pe-
nalized more, reflecting their relatively less prevalence in the genome alignment. As
an example, let us use the scores +2 for a match, –1 for a mismatch and –2 for an
indel. Given the two DNA sequences X = ATGGCTACAC and Y = GTGTACTAC,
we can have various alignments four of which are shown with mismatches (m) and
indels (i) marked. Among these four options, the alignment in (a) or (b) should be
chosen as they both have the highest scores.
114 6 Sequence Alignment
The aim of any alignment method is to maximize the total score. However, there
are exponential number of combinations to check and if we can find an alignment that
has a higher score than others, then using it should be preferred. Formally, alignment
of two sequences can be defined as follows:
Definition 6.1 (sequence alignment) Let org be an alphabet and X = x1 . . . xn
and Y = y1 . . . ym be two sequences over this alphabet and let ← org ∪{−},
that is, the space character added to the original alphabet. An alignment of these two
sequences is a two-row matrix where the first row are the elements of X and the
second row are the elements of Y and each row contains at least one element of org .
A related parameter between two sequences is the edit distance between them
which is defined as follows:
Definition 6.2 (edit distance) Edit distance, or the Levenshtein distance, is the min-
imum number of substitutions, insertions, and deletions between two sequences.
Hamming distance is an upper bound on edit distance.
For the above example, the edit distance between the two sequences is 4 which
occurs in (a). The procedure for sequence alignment is very similar to finding LCS
between them; however, we now have costs associated with matches, mismatches,
and indels and search for the highest scoring alignment.
Proteins consist of a sequence of amino acids from a 20-letter alphabet. Some mis-
matches are more likely to occur in proteins and they are more frequently encountered
than other substitutions. This fact necessitates the use of a weighting scheme for each
amino acid substitution. Scoring matrices for mismatches in protein amino acid se-
quences define the scores for each substitution in these sequences. The two widely
used matrices for this purpose are the point accepted mutation (PAM) matrix [11] and
the blocks substitution matrix (BLOSUM) [17]. They both use statistical methods
and are based on counting the observed substitution frequency and comparison of
this value with the expected substitution frequency.
A positive score in the entry m i j of a PAM matrix M means that the probability of
the substitution between i and j is more than its expected value; therefore, it bears
some significance. The entry m i j is formed by considering the expected frequencies
of i and j, and the frequency of alignment between i and j in the global alignment
of homologous sequences [4]. The nth power of M is then taken to form the PAM-n
matrix such as PAM-80, PAM-120, or PAM-250. A large value of n should be used
to align proteins that are not closely related.
6.2 Problem Statement 115
PAM may not provide realistic values for remotely related protein sequences as it
uses extrapolation of values. BLOSUM matrix structure proposed by Henikoff and
Henikoff [17] overcomes this difficulty by analyzing segments of proteins rather than
the whole. If two segments of proteins under consideration have similarity over a
threshold value, they are clustered. The threshold value t is specified as BLOSUM- p
which means the matrix is generated by combining sequences which have at least t %
similarity. The BLOSUM62 matrix which which is formed by clustering sequences
that have at least 62 % identity level is shown below. A small value of t is used
for distantly related protein sequences and more closely related ones can be aligned
using a larger value.
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
Let us form the alignment array M for two given DNA sequences X = ACCGT
and Y = AGCCTC with n = 5 and m = 6; and we select α = 2, β = −1 and
γ = −1 for simplicity. We first fill the first row and column of M with gap penalties
and assign 0 value for the first entry. We then form each entry using the dynamic
programming relation in the NW algorithm to obtain the final array as shown below.
A G C C T C
0 -1 -2 -3 -4 -5 -6
A -1 2 ← 1 0 -1 -2 -3
C -2 1 1 3 2 1 0
C -3 0 0 3 5 4 3
↑
G -4 -1 2 2 4 4 3
T -5 -2 1 1 3 6 ← 5
118 6 Sequence Alignment
The global alignment problem can now be reduced to finding the best scoring
path between vertices M[0, 0] and M[n, m]. We can now start from the lowest right
corner of M and work our way upwards until we reach M[0, 0] by following in
reverse direction of the path we have chosen while filling the array. An up arrow
means a gap in the top sequence, a left arrow represents a gap in the second sequence
on the lefthandside, and a diagonal arrow shows a match or a mismatch without any
gap in that position. The alternative paths result in alternative alignments with the
same score. Implementing this procedure for the above example yields the following
alignment with a score of 5 (4 matches and 3 indels).
A G C C - T C
A - C C G T -
The time to fill the array is O(nm) which is the size of the array and hence the
space requirement is the same. At the end of the algorithm, best alignment score is
stored in M[n, m].
The global alignment may not provide the correct results because of the genome
shuffling and rearrangements. A segment of a sequence inversion may happen in a
sequence causing a subsequence to look radically different. Local alignment pro-
vides us information about the conserved subsequences within organisms. A local
alignment between two sequences with four matches and a mismatch is shown in
bold below.
A T G C T A G T G C C
G C A C T T G T A A T
algorithms are as follows. First, a fourth value of zero is allowed in addition to the
three possible values in the NW algorithm to prevent negative values in the alignment
graph. The first row and column of the array M contain zeros now to discard gaps
occurring in the beginning of sequences. We still attempt to find a path with maximum
value but we do not have to start from the beginning to allow for local alignments.
Instead, we start with the maximum value of the array and stop when a zero is
encountered which signals the end of the regional alignment. Algorithm 6.2 shows
the pseudocode of this algorithm.
G G A T A C G T A
T C A T A C - T
120 6 Sequence Alignment
G G A T A C G T A
0 0 0 0 0 0 0 0 0
T 0 0 0 2 ← 1 0 0 2 1
C 0 0 0 1 1 3 2 1 1
A 0 0 2 1 3 2 2 1 3
T 0 0 1 4 3 2 1 4 3
A 0 0 2 3 6 5 4 3 6
C 0 0 1 2 5 8 ← 7 6 5
T 0 0 0 3 4 7 6 9 8
If we start with 6 shown in the upper path, a local alignment with one mismatch
and one gap, and a score of 6 is obtained as follows.
G G A T A C G T A
T - C A T A C T
We can have various local alignments between these two sequences using alter-
native paths, starting with the next largest value and continuing until a 0. The time
and space complexities of this algorithm are O(nm) as in NW algorithm since we
need to fill the alignment matrix as before.
The center star method is an approximation algorithm with a ratio of 2. The main
idea of this algorithm is to identify a sequence which is closest to all others as the
center and then work out the alignments of all sequences with respect to this center.
There are various methods to measure the distances between the sequences. As a
simple approach, we can find the consensus sequence Scs of S = {S1 , . . . , Sn } of n
input sequences which is the sequence containing the most frequent symbols ofeach
column to be matched in each sequence. We can then work out the distance of each
sequence Si to Scs and mark the sequence with shortest distance to Scs as the central
sequence Sc . Alternatively, we can compute pairwise distances between all pairs of
sequences which is called the sum of pairs distance which is used in the center star
method. The center star algorithm specifically consists of the following steps:
1. Input: A set S = {S1 , . . . , Sk } of k sequences of length n each.
2. Output: MSA of sequences in S .
3. Work out the distance matrix D between sequences such that di j entry of D is
equal to the distance between Si and S j .
k
4. Find the center sequence Sc that has the minimum value of sum of pairs, i=1 di j .
5. For each Si ∈ S \ Sc , find an optimal global alignment between Si and Sc using
Needleman–Wunsch algorithm.
6. Insert gaps in Sc to complete MSA.
The center sequence is the one that is most similar to all other sequences. For
example, given the four sequences below, we can find their pairwise similarities as
shown.
S1 S2 S3 S4
S1 0 3 3 5
S2 3 0 2 4
S3 3 2 0 1
S4 5 4 1 0
The total number of comparisons is k(k − 1)/2 times, 6 in this case, resulting in a
total time complexity of O(k 2 n 2 ) for this step. We can then form the distance matrix
D with these values and find the sequence that has the greatest similarity to all others
as shown in Fig. 6.2. This step involves summing rows of the matrix D and detecting
the sequence with the lowest sum, which is S3 in this case, in O(k 2 n 2 ) time.
We now need to align sequences S1 , S3 , S4 to the central sequence S2 by the
Needleman–Wunsch algorithm in O(kn 2 ) time. Finally, gaps are inserted in the
aligned sequences to complete the multiple sequence alignment O(k 2 n) resulting
in a total time of O(k 2 n 2 ) since the first step dominates. It can be shown using
the triangle inequality between three sequences that the approximation ratio of this
algorithm is 2 [33].
1. Computation of all pairwise alignment scores and forming the distance matrix D
based on these scores.
2. Construction of a phylogenetic tree T using D by the neighbor-joining (NJ)
method (see Sect. 14.3).
3. Perform MSA with the sequences starting with the closely related ones in the tree.
pairwise sequence alignment with the closest sequences in the tree. Figure 6.3 dis-
plays an example phylogenetic tree where the input sequences are the leaves of this
tree.
The NJ algorithm in general will produce unrooted trees where the input sequences
may not be equidistant to their ancestors. The root in this tree is placed in a location
from which the average distances on its both sides are equal. The CLUSTALW
algorithm will use this tree to pairwise align the closest sequences as guided by the
tree. Starting from the leaves, the closest leaves are aligned iteratively to form larger
clusters at each step. Progressive alignment methods introduce significant errors
when sequences are distantly related. Also, the guide tree is formed using pairwise
alignments which may not reflect the evolutionary process accurately.
Suffix trees can be used for global sequence alignment. A fundamental method used
for this purpose is called anchoring in which similar regions called anchors in two
sequences are first identified using suffix trees. The segments between the anchors
are then aligned using dynamic programming or using the same method recursively
or a combination of both approaches.
MUMmer is one such algorithm that uses suffix trees for extracting maximal
unique matches (MUMs) that are used for anchoring [12]. Given two sequences X
and Y of lengths n and m, a MUM is a subsequence of both X and Y of length
greater than a given threshold d. A MUM of X and Y has to be unique in both of
them. A brute force algorithm needs to search all possible prefixes of both strings
in O(nm) time. However, this problem can be simplified by the aid of a generalized
suffix tree. We need to build a generalized suffix tree for the two strings and search
for internal nodes that have exactly two leaves, one from each sequence. We then
check whether the node representing the substring is maximal. If this condition is
satisfied, the prefix starting from the root and ending at the internal node represents
a MUM. This algorithm takes O(n + m) steps which is the time to construct the
generalized suffix tree and also the time for other steps.
124 6 Sequence Alignment
The order of MUMs is also conserved between related genomes, and therefore
we can predict that the conserved regions in two biological sequences contain or-
dered MUMs rather than randomly distributed MUMs. The idea behind the MUM-
mer algorithm is this observation and it attempts to find these conserved regions by
finding the longest common subsequence (LCS) of them [12]. The LCS problem
can be solved by the dynamic programming algorithm we reviewed in Sect. 5.4 in
O(n 2 ) time and O(n 2 ) space or by using generalized suffix trees. However, since
each MUM is unique, it can be replaced by a special character allowing a solu-
tion in O(n log n) time [33]. The regions between the anchors are aligned using
the Needleman–Wunsch algorithm. Multiple genome aligner (MGA) is a tool for
multiple sequence alignment based on suffix trees [18]. The longest nonoverlapping
sequence of maximal multiple exact matches (multiMEMs) are computed and then
used to guide the multiple alignments in this algorithm. The LAGAN [5] is another
tool based on anchoring; however, it uses the CHAOS local alignment algorithm and
uses the local alignments produced by CHAOS as anchors limiting the search area
of the Needleman–Wunsch algorithm around these anchors [6]. LAGAN provides
the visual display of alignment results.
6.6.1 FASTA
FASTA is an early local pairwise sequence alignment tool for database comparison
of a biological sequences [27]. Its predecessor was called FASTAP [21] and handled
protein sequence alignment only, and since FASTA can search for both protein and
DNA sequences, it was called FASTA (Fast-All). It is a heuristic algorithm that
compares a given input query sequence against the sequences in a database. Its
operation can be summarized as follows:
1. Given a query sequence Q and a set of sequences S = S1 , . . . , Sn in the database,
it searches for exact matches of length l between the query and a database sequence
Si ∈ S . These matches are called hotspots. Commonly used values of values of
6.6 Database Search 125
l are 2 for protein amino acid sequences and between 4–6 for DNA sequence
comparisons.
2. The hotspots are combined into a long sequence called initial regions. These
regions are scored using the similarity matrix M. Only a small part of M is
aligned and the best scoring 10 alignments are considered for the next step.
3. Using dynamic programming, the ten best partial alignments are combined to
give a longer alignment.
4. SW algorithm is used to align these sequences.
The main idea of this program is to find subsequence matches between the query
sequence and each of the sequences in the database, enlarge them and compute local
alignment in these regions using dynamic programming. There are few efforts on
parallelizing FASTA such as [19,30] on a cluster of workstations.
6.6.2 BLAST
Basic local alignment search tool (BLAST) developed at the National Center for
Biotechnology Information by Altchul and colleagues [1] is a popular tool for local
sequence alignment. BLAST and its derivative algorithms are one of the most widely
used tools for sequence alignment. The main idea of BLAST is to search only a
subspace of the sequences. In its basic version, gaps are not allowed during alignment
which simplifies the alignment procedure greatly. The assumption here is if there is
a similarity between two sequences, it will show even if the gaps are not allowed.
A segment pair in BLAST is defined as a pair of equal-length subsequences be-
tween two sequences S1 and S2 which are aligned without gaps. A maximal segment
pair (MSP) of S1 and S2 is the highest scoring segment between them. As the first
step, BLAST searches all sequences with length l in the database that have an MSP
score higher than a threshold τ with the input query Q [9]. It searches the short
sequences first and then extends them. The found subsequences are called hits which
are then extended in both directions to find if the score is higher than τ . In detail,
BLAST performs the following steps:
1. We are again given a query sequence Q and a set of sequences S = S1 , . . . , Sn in
the database. BLAST searches hits of length l that have an MSP of score higher
than τ between the query and the database sequence Si ∈ S . Typical values of l
are 3 for protein amino acid sequences and 11 for DNA sequences. The threshold
τ is dependent on the scoring matrix used.
2. It searches for pairs of hits which have a maximum distance of d between them.
3. The hit pairs are extended in both directions and the alignment score is checked
at each extension. This process is stopped when the score does not change. The
pair of hits scoring above a threshold after the extension are called high scoring
pairs (HSPs).
4. The consistent HSPs are combined into local alignment that gives the highest
score.
126 6 Sequence Alignment
The newer versions of BLAST alow gaps [2,26]. The BLAST algorithm also
provides an estimate of the statistical significance of the output. This tool is available
for free usage at www.ncbi.nlm.nih.gov/blast/.
We have reviewed basic global and local exact alignment algorithms and the com-
monly used database alignment tools which use heuristic methods that provide ap-
proximate results. The database tools are simpler to parallelize on a distributed mem-
ory computer system as we can easily partition the database across the machines or
duplicate it if its size is not very large. We will first look at ways of parallelizing the
exact algorithms and then review existing methods for distributed alignment using
the database tools.
0 0 0 0 0 0
Fig. 6.5 Distributed BLAST using replicated database. The input query batch is Q 1 , . . . , Q n . Each
worker Wi has a copy of the database and receives a portion of the input query batch from the
supervisor process (SUP). Each worker then runs part of the BLAST query in its local database,
obtains the results (A p , . . . , Aq ) and sends them to the supervisor which combines the partial results
and outputs them
batch and broadcasts the batch to all workers. Upon a worker Wi announcing it is idle
and can start working, the supervisor assigns Wi a database segment. The worker Wi
then performs alignment in the segment it is allocated and reports the result to the
supervisor process. The operation of the algorithm is very similar to what is depicted
in Fig. 6.6 with the additional enhanced load balancing in which a worker that has
finished searching a database can be assigned another search in a different database
segment as assigned by the supervisor. The authors report that mpiBLAST achieves
super-linear speedup in all tests [10].
The multithreaded versions of BLAST are also available to run on shared memory
multiprocessor systems. This mode of operation is similar to partitioned database
approach; however, the database is loaded to shared memory now and each thread
can work in its partition. Thread and shared memory management may incur over-
heads and cause scalability issues. NCBI BLAST [25] and WU BLAST [34] are the
examples of multithreaded BLAST systems [35]. The UMD–BLAST is an interface
that enables to use the most suitable parallel/distributed BLAST algorithm. It inputs
the database size, query batch size, and query length and determines which algo-
rithm to use. For large databases which cannot be accommodated in the memory
of a single computer, UMD–BLAST uses mpiBLAST; for long query batches with
not very large query lengths, BLAST++ which employs replicated database is used.
Otherwise, the multithreaded BLAST is employed and the outputs are combined [35].
6.7 Parallel and Distributed Sequence Alignment 129
Fig. 6.6 The operation of distributed BLAST using partitioned database. Each worker Wi now has
a segment of the database and receives the full input query batch from the supervisor process (SUP).
It then runs all of the BLAST query in its database partition, finds the results, and sends them to
the supervisor which combines the partial results and outputs them
Let us review the main three steps of the CLUSTALW algorithm [32]. In the first
step, it computes pairwise distances followed by the construction of the guide tree
using the neighbor-joining algorithm in the second step. This tree is used as a guide
to perform alignment in the last step where the leaves are first aligned followed by
the alignment of close nodes in the tree in sequence.
Parallelization whether using shared memory or distributed memory computers
involves implementing these three steps in parallel. The first step requires calculation
of distance between k sequences with k(k − 1)/2 comparisons. This step is trivial
to parallelize again using the supervisor–worker model of parallel computation. We
can have the supervisor send groups of sequences to each worker process which
compute the distances and the results are then gathered at the supervisor. The parallel
implementation of the second and third steps is not so straightforward due to the data
dependencies involved.
ClustalW-MPI [20] is the distributed implementation of the CLUSTALW method
using the MPI parallel programming environment based on the described approach
130 6 Sequence Alignment
above. The distance matrix is first formed by allocating chunks of independent tasks
to processes. Large batches result in decreased interprocess communication times
but may have poor load balancing. On the other hand, small size of batches provides
balanced process loads at the expense of increased communication overheads.
The guided tree is formed by the neighbor-joining method as in CLUSTALW;
however, few modifications to the original algorithm resulted in the complexity of
O(n 2 ) time for constructing the guided tree. As this algorithm searches sequences
that are closer to each other but also have the highest distance to all other clusters, a
parallel search method was designed to search for such sequence clusters. However,
details of this method are not described in the paper. In the final progressive alignment
step, a mixture of fine and coarse-grained parallelism methods is used. Coarse-grain
parallelism involves aligning external nodes of the guided tree and the speedup
obtained is reported as n/ log n where n is the number of nodes of the tree. The authors
also implemented recursive parallelism and calculated the forward and backward
steps of the dynamic programming in parallel. They showed experimentally the
speedup achieved for aligning 500-sequence test data as 15.8 using 16 processors.
Another parallel version of CLUSTALW, called pCLUSTAL, which can run on
various hardware from parallel multiprocessors to distributed memory parallel com-
puters using MPI was described in [8]. This study also uses supervisor–worker par-
adigm in which the supervisor process p0 maps the sequence-pairs to processes and
each process then performs sequential CLUSTALW algorithm on its own data set.
The results are then gathered at p0 which builds the guided tree T . It then examines
T for independently executable alignments and assigns these to processes. The final
step involves gathering of all the alignment results at p0 . The experiments were car-
ried on protein sequences of average length of 300 amino acids. They showed the
time-consuming pairwise alignment step takes time proportional to 1/k where k is
the number of processors.
A shared memory implementation CLUSTALW in SGI multicomputers was de-
scribed in [22] using OpenMP and speedups of 10 on 16 processors was reported,
and a comparison of various implementations is presented in [13].
the search space by sampling of the data which results in favorable run times for large
sequences. FASTA and widely used BLAST are two commonly used tools which
adopt heuristic tools.
However, even the heuristic methods in sequential form are increasingly becoming
more inadequate as the sizes of databases increase due to the expansion in the number
of discovered sequences as a result of high volume efficient sequencing technologies.
A possible way to speedup the heuristic alignment methods is to employ parallel
and distributed processing. This can be achieved typically either by replicating the
database if this is not relatively large, or partitioning it. We described these two
approaches as implemented in various BLAST versions.
Sequence alignment is probably one of the most investigated and studied topic
in bioinformatics and the tools for this purpose are among the mostly publicly used
software in bioinformatics. There are books devoted solely to this topic as general
alignment or multiple sequence alignment, and this topic is treated in detail in many
contemporary bioinformatics books. Our approach in this chapter was to briefly
review the fundamental methods of alignment only, with emphasis on distributed
alignment.
Exercises
1. Work out the global alignment between the two DNA sequences below using the
dynamic programming approach of NW algorithm. Show all matrix iterations.
A T G G C T A G T A C C
G T G C T T G T A C C
2. Find the local alignment between the two protein sequences below using the SW
algorithm. Show all matrix iterations.
B N Q R S T U R V Y A C K
A N Q T T V T U R X E A C
3. For the following four DNA sequences, implement center star method of multiple
sequence alignment by first finding the distances between them and forming the
distance matrix D. Find the central sequence and align all of the sequences to
the central sequence using NW algorithm and finally insert gaps in sequences to
complete the alignment.
S1: A C C G A A C
S2: A G C G C T G
S3: C C C T A T G
S4: A T C G A T G
4. Compare FASTA and BLAST in terms of method used and the accuracy achieved.
132 6 Sequence Alignment
5. Given the following two DNA sequences, draw the general suffix tree for them
and find the LCS of these two sequences using this generalized suffix tree.
G T A C C T A A G T C A
A G T C T G A A C T G
References
1. Altschul SF, Gish W, Miller W, Myers EW et al (1990) Basic local alignment search tool. J
Mol Biol 215(3):403–410
2. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped
BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids
Res 25(17):3389–3402
3. Bjornson R, Sherman A, Weston S, Willard N, Wing J (2002) Turboblast: a parallel imple-
mentation of blast based on the turbohub process integration architecture. In: IPDPS 2002
Workshops
4. Boukerche A (2006) Computational molecular biology. In: Albert YZ (ed) Parallel computing
for bioinformatics and computational biology models, enabling technologies, and case studies.
Wiley series on parallel and distributed computing, Chap, 6
5. Brudno M, Do C, Cooper G, Kim M, Davydov E, Green ED, Sidow A, Batzoglou S (2003)
LAGAN and multi-LAGAN: efficient tools for large-scale multiple alignment of genomic
DNA. Genome Res 13:721–731
6. Brudno M, Chapman M, Gottgens B, Batzoglou S, Morgenstern B (2003) Fast and sensitive
multiple alignment of large genomic sequences. BMC Bioinform 4:66
7. Chaudhary V, Liu F, Matta V, Yang LT (2006) Parallel implementations of local sequence
alignment: hardware and software. In: Albert YZ (ed) Parallel computing for bioinformatics
and computational biology models, enabling technologies, and case studies. Wiley series on
parallel and distributed computing, Chap, 10
8. Cheetham JJ et al (2003) Parallel CLUSTALW for PC clusters, computational science and its
applications. In: ICCSA 2003, pp 300–309. LNCS. Springer
9. Cristianini N, Hahn MW (2006) Introduction to computational genomics, a case studies ap-
proach. Cambridge University Press, Cambridge
10. Darling A, Carey L, Feng W-C (2003) The design, implementation, and evaluation of mpi-
BLAST. In: Cluster world conference and expo and the 4th international conference on linux
clusters: the HPC revolution, San Jose, CA
11. Dayhoff MO, Schwartz RM, Orcutt BC (1978) A model of evolutionary change in proteins.
Atlas Protein Seq Struct 5(3):345–352
12. Delcher AL, Phillippy A, Carlton J, Salzberg SL (2002) Fast algorithms for large-scale Genome
alignment and comparison. Nucleic Acids Res 30(11):2478–2483
13. Duzlevski O (2002) SMP version of ClustalW 1.82, unpublished. http://bioinfo.pbi.nrc.ca/
clustalw-smp/
References 133
14. Galper AR, Brutlag DR (1990) Parallel similarity search and alignment with the dynamic
programming method. Technical report KSL 90–74, Stanford University
15. Grant J, Dunbrack R, Manion F, Ochs M (2002) BeoBLAST: distributed BLAST and PSI-
BLAST on a Beowulf cluster. Bioinformatics 18(5):765–766
16. Gropp W, Lusk E, Skjellum A (2014) Using MPI: portable parallel programming with the
message passing interface, 3rd edn. MIT Press, ISBN: 9780262527392
17. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc
Nat Acad Sci USA 89(22):10915–10919
18. Hohl M, Kurtz S, Ohlebusch E (2002) Efficient multiple genome alignment. Bioinformatics
18:312–320
19. Janaki C, Joshi RR (2003) Accelerating comparative genomics using parallel computing. Silico
Biol 3(4):429–440
20. Li Kuo-Ben (2003) ClustalW-MPI: ClustalW analysis using distributed and parallel computing.
Bioinformatics 19(12):1585–1586
21. Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science
227(4693):1435–1441
22. Mikhailov D, Cofer H, Gomperts R (2001) Performance optimization of Clustal W: parallel
Clustal W, HT Clustal, and MULTICLUSTAL. White papers, Silicon Graphics, Mountain
View, CA
23. Mount DM (2004) Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor
Laboratory Press, Cold Spring Harbor. ISBN 0-87969-608-7
24. Naruse A, Nishinomiya N (2002) Hi-per BLAST: high performance BLAST on PC cluster
system. Genome Inform 13:254–255
25. National Center for Biotechnology Information, NCBIBLAST. http://www.ncbi.nih.gov/
BLAST/
26. Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Method
Enzymol 183:63–98
27. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Nat
Acad Sci USA 85:2444–2448
28. Rognes T, SeeBerg E (2000) Six-fold speedup of Smith Waterman sequence database searches
using parallel processing on common microprocessors. Bioinformatics 16(8):699–706
29. Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phy-
logenetic trees. Mol BioI Evol 4(4):406–425
30. Sharapov I (2001) Computational applications for life sciences on Sun platforms: performance
overview, Whitepaper
31. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol
147:195–197
32. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTALW: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting, positions-specific gap
penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680
33. Wing-Kin S (2009) Algorithms in bioinformatics: a practical introduction. CRC Press (Taylor
& Francis Group), Chap, 5
34. WU blast http://blast.wustl.edu/blast/README.html. Washington University School of Medi-
cine
35. Wu X, Tseng C-W (2006) Searching sequence databases using high-performance BLASTs.
In: Albert YZ (ed) Parallel computing for bioinformatics and computational biology models,
enabling technologies, and case Studies. Wiley series on parallel and distributed computing,
Chap, 9
Clustering of Biological Sequences
7
7.1 Introduction
Clustering is the process of grouping objects based on some similarity measure. The
aim of any clustering method therefore is that the objects belonging to a cluster should
be more similar to each other than to the rest of the objects analyzed. Clustering is
one of the most studied topics in computer science as it has numerous applications
such as in bioinformatics, data mining, image processing, and complex networks
such as social networks. Recent technologies provide vast amounts of biological
data and clustering of this data to obtain meaningful groups is one of the fundamen-
tal research areas in bioinformatics. Clustering of biological sequences commonly
involve grouping of three types of data; the genome, protein amino acid sequences,
and expressed sequence tags (ESTs) which are small segments from complemen-
tary DNA (cDNA) sequences [1]. ESTs are used for gene structure prediction and
discovery, and a cDNA is formed by transcription from the mRNA.
We will make a distinction between clustering data points which we will call
data clustering, and clustering objects which are represented as vertices of a graph
in which case we will use the term graph clustering. Graph clustering methods for
biological networks will be reviewed in Chap. 11, and our emphasis in this chapter
is data clustering as applied to biological sequences.
The two classical clustering methods of data are the hierarchical clustering and
partitional clustering. Hierarchical clustering algorithms iteratively build clusters
either starting with each data point as a cluster and merge them to form larger clusters
in each step, or start with whole data as one cluster and iteratively divide the clusters
to form smaller clusters. The first method is called agglomerative and the second is
the divisive hierarchical clustering. The partitional clustering methods on the other
hand, attempt to directly group data points, and start with some initial clustering to
be modified at each step to yield better clusters.
Since our aim is to group similar DNA/RNA protein sequences, their similarities
can be evaluated using the sequence alignment methods and tools such as BLAST in
the first step of clustering. The edit distance which is the minimum number of indel
© Springer International Publishing Switzerland 2015 135
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_7
136 7 Clustering of Biological Sequences
and substitution operations between two sequences can then be found and the hier-
archical or partitional methods can be employed to find clustering of the biological
sequences. Nevertheless, there are methods specifically tailored and optimized for
biological sequences other than these fundamental approaches. These methods can
be broadly categorized as sequence alignment-based approaches and alignment-free
approaches.
In this chapter, we first describe the clustering problem formally by reviewing
basic similarity and validation parameters. We will then inspect the two main methods
first and then investigate the clustering algorithms aiming at biological sequences.
The parallel and distributed algorithms for various approaches will be discussed to
conclude the chapter.
7.2 Analysis
Our aim in clustering is the grouping of similar objects. We therefore need to asses
the similarities or dissimilarities between the data points under consideration. The
methods to accomplish this assessment can be grouped under distance measures and
similarity measures.
Distance measures show the dissimilarity between two data points; the more distant
they are, naturally, the more dissimilar they are. The Minskowski distance between
two data points pi = pi1 , ..., pir and pj = pj1 , ..., xjr each having r dimensions is
defined as below [11].
r 1/m
d(pi , pj ) = (d(pik , pjk ) )
m
(7.1)
k=1
When m = 2, this distance is called the Euclidian distance as widely used in
practice. The weighted Euclidian distance is defined as:
r
d(pi , pj , W ) = (wk · d(pik , pjk ))2 (7.2)
k=1
where wj is the weight assigned to the jth component. The cosine similarity is defined
as the vector dot product of two vectors x and y characterizing the two data points
as follows: x· y
s(x, y) = (7.3)
x y
7.2 Analysis 137
Once we have the clusters formed, we need to check the quality of the groups formed.
Our aim in any clustering method is to group similar objects into the same cluster
which should be well separated from other clusters. Validation of a clustering method
therefore involves checking how similar the objects in a cluster are and how dissimilar
they are to the members of other clusters. Statistical methods can be employed to find
mean, variance, and standard error of the members of a cluster to find intra-cluster
similarity. Separation of clusters can be evaluated by computing their distances and
the similarity/separation ratio may display the quality of the clustering method [15].
Moreover, this ratio of partitioning can be compared with the ratio of a random
partitioning method to validate its performance.
Formally, the internal cluster dissimilarity parameter ISCk of a cluster Ck can be
evaluated by computing the average distance between its members to its centroid ck
as follows:
1
ISCk = d(pi , ck ) (7.4)
nk
pi ∈Ck
where N is the total number of clusters. The inter-cluster distance ESCp ,Cq between
two clusters Cp and Cq can be approximated by the distance between their centroids
cp , cq as:
ES(Cp ,Cq ) = d(cp , cq ) (7.6)
and the average value of this parameter for the whole data is:
1
ESC = np nq ECp ECq (7.7)
p= np nq i∈Ck
Hierarchical clustering is a widely used method for clustering objects due to its
simplicity and acceptable performance for data of medium size. This method of
clustering can be classified as agglomerative or divisive. In agglomerative clustering,
each data point is considered as a cluster initially. The two clusters that are closest to
each other using some metric are then merged to form a new cluster at each step. This
process is repeated until a single cluster is obtained. For individual points, Euclidian
distance suffices to determine the smallest distances, and in the general case, distance
between two clusters Ci and Cj can be defined as follows:
• Single link: The distance d(Ci , Cj ) is the distance between the two closest points
x ∈ Ci and y ∈ Cj .
• Complete link: d(Ci , Cj ) is between the two farthest points in Ci and Cj .
• Average link: The distance is calculated by finding the average of the distances
between every pair of points in Ci and Cj .
Figure 7.1 displays these measures. The divisive algorithms on the other hand
start with one cluster that contains all of the objects and then divide the clusters
iteratively into smaller ones. The output of a hierarchical clustering is a structure
called a dendogram which is a tree showing the clusters merged at each step and the
dendogram can be divided into the required number of clusters by a horizontal line.
(a) (b)
6
3
p2
p1 1
A
p8 p9
p5 4
B
5 p6
2
p4
p7
p3 p1 p8 p5 p4 p7 p6 p2 p9 p3
Fig. 7.2 A single-link agglomerative hierarchical clustering example. The steps of the algorithm
in (a) is shown by italic numbers outside the clusters. The output dendogram is shown in (b) is
divided into clusters {p1 , p8 , p5 }, {p4 , p7 , p6 , p2 , p9 }, and {p3 } by horizontal line A; and {p1 , p8 },
{p5 }, {p4 , p7 , p6 }, {p2 , p9 }, and {p3 } by B
since the formed clusters will eventually be separate groups of data points. There
is not a single method of selecting initial centroids that works for all input data
combinations as reported in [17].
3. Definition of the objective function: As the objective function, we need to ensure
the clusters obtained are stable, in other words, their centroids remain almost
constant. A widely used parameter for the objective function is the sum of squared
errors (SSE) defined as follows:
N
SSE(C ) = d(pi , ck )2 (7.8)
j=1 pi ∈Ck
where Ck is the cluster k and ck is the centroid of this cluster. The SSE parameter
displays the sum of the square of distances of each data point to its centroid,
accumulated over all clusters. The centroids can be selected using this criteria
and the objective function can then be stated as to minimize the SSE parameter.
Figure 7.3 shows the execution of the k-means algorithm in a small dataset. In
order to obtain k clusters of n data objects with each having r dimensions, the k-
means algorithm runs in O(nmkr) time where m is the number of iterations. Typically,
all of these parameters are much smaller than data size n providing several steps of
convergence time. The main advantage of k-means algorithm is that it has low time
complexity and the convergence is obtained in several number of steps in practice. It
has some limitations though; it works fine when the size of the clusters are similar,
and their densities are in comparable ranges. It may not produce the right clusters if
these conditions are not valid or the data contains outliers and noise. Also, another
disadvantage of this algorithm is that the initial placement of the centroids affect
the quality of clustering obtained and therefore additional sophisticated methods are
needed for this purpose. However, k-means algorithm and its derivatives have found
wide range of applications including bioinformatics due to their simplicities and low
running times.
(a)
(b)
Fig. 7.3 k-means algorithm example, a There are 14 data points p1 , ..., p14 and the initial centroids
for 3 clusters are X1, X2, and X3. Each data point is assigned to its nearest centroid resulting in
clusters C1, C2, and C3. b The new centroids are computed and each data point is assigned to its
nearest centroid again
C2
C1
C3
implemented as the average distance to the centrotype or the sum of distances to it.
Algorithm 7.3 shows the pseudocode for this algorithm which has a similar structure
to the k-means algorithm. The k-modes algorithm is very similar to the k-medoids
algorithm which can efficiently determine clusters of categorical data.
Some datasets have nonspherical shapes, and clusters in these data cannot be easily
discovered by the hierarchical or partitional algorithms as these assume data obey a
probability distribution. Density-based clustering algorithms aim to discover clusters
of such arbitrary-shaped data. Density-based spatial clustering of applications with
noise (DBSCAN) is one such algorithm where density inside a fixed radius area
around a data point is considered [8]. A core point p has at least MinPts number
of data points within the radius of Eps from itself. A point q within the Eps radius
of a point p is called density-reachable from p. If there is another point that is both
density-reachable from both p and q; p and q are said to be density-connected. In
DBSCAN, clusters are formed with density-connected points. Noise data points are
not contained in any cluster and border data points are included in clusters but they
do not have dense regions around them, that is, they are not core points. The irregular
clusters obtained by a density-based clustering algorithm are shown in Fig. 7.4. In
grid-based clustering, the data space is partitioned into a number of grid cells. Data
densities in each of these cells is evaluated and neighboring cells with high densities
are combined to form clusters.
144 7 Clustering of Biological Sequences
Our aim in the clustering of biological sequences is to arrange them in groups to asses
their functions and evaluate phylogenetic relationships between them which can be
used for many applications such as understanding diseases and design of therapies.
As stated before, the edit distance between two sequences is the number of indels
(insertions-deletions) and substitutions to transform one to another. For example,
two DNA sequences S1 and S2 and their alignment are given below:
S1: A T T C G T T G G A C C S1: A T T C G T T G G A C C
S2: G T T C T T A G A C S2: G T T C - T T A G A - -
m i m i i
There are two substitutions and 3 indels resulting in an edit distance of 5 between
these two sequences. The lengths of the sequences to be aligned need not be the same
as in this example. For global alignment where we searched for the alignment of two
or more whole sequences, we can use the dynamic programming approach of the
Smith–Waterman algorithm of Sect. 6.4 with time complexity of O(nm), where n and
m are the lengths of the two sequences. For the clustering of biological sequences,
using global alignment tools seems as the right choice at first glance as we are inter-
ested in the clustering of the whole sequences. However, proteins which are chains
of amino acids consist of groups of these molecules called domains. It was found that
two globally very much different proteins may have similar domains. Surprisingly,
these proteins may have similar functions [4]. Global alignment of such proteins
would result in high edit distances, and they would be placed in different clusters
although they should be included in the same cluster as they have similar functions.
Under these circumstances, local alignment algorithms such as Needleman–Wunsch
algorithm or tools like FASTA and BLAST should be used to find the distances
between the sequences. All of these alignment methods, whether global or local,
provide the edit distances between the sequences and once these are obtained, we
can use any of the well-known clustering methods that consider distances between
data points such as hierarchical or partitional algorithms. Although these classical
algorithms have polynomial time complexities, the biological sequences are huge
making even these polynomial times unacceptable in practice. For this reason, sig-
nificant research has been oriented toward clustering aiming at biological sequences
making use of some property of these sequences.
Then, a vector VS which has an element vi showing the frequency of the segment
si ∈ S is formed. The similarity search process between two segments S1 and S2
represented by vectors V1 and V2 can then be performed by vector operations such as
dot product or Euclidian distance calculation [4]. The subsequence-based clustering
is another method that aims biological sequences. The frequent subsequences called
motifs, as we will describe in the next chapter, are searched in a sequence and once
these are discovered, sequences with similar motifs can be clustered.
The general idea of the graph-based sequence clustering method is to represent the
sequences as nodes of a graph G. The edges of the graph represent the similarities
between the sequences. The similarities or dissimilarities in the form of distances
can be obtained using a sequence alignment method output of which is the distance
matrix D. We can then check this matrix for distance d(u, v) between each pair of
sequences (u, v) in O(n2 ) time, and if d(u, v) is less than a defined threshold τ , edge
(u, v) is added to the edges of the graph G. Once graph G is formed, clustering of
sequences is reduced to dividing G into locally dense regions which are the sequences
with high similarities. Algorithm 7.4 displays the pseudocode for this algorithm.
Figure 7.5 displays a small sample graph formed and the clustering of this graph
into dense local regions. There are many graph clustering algorithms as we will
investigate in Chap. 11.
146 7 Clustering of Biological Sequences
The parallel and distributed algorithms for clustering large biological sequences can
be broadly classified as memory-based algorithms and disk-based algorithms [4]. If
data can be accommodated in the memories of the distributed memory computers,
memory-based algorithms can be used. Otherwise, data on multiple disks should
be managed. MapReduce is a software environment which handles disk resident
data in parallel [6]. The distributed algorithms for biological sequence clustering in
general follow the data partitioning approach. The distances between n sequences are
initially distributed by a supervisor process to k processes each of which performs
local clustering of sequences it is assigned. The results are then typically sent to
the supervisor which merges the clusters and do some postprocessing. It may also
send the partial results to the worker processes for further refinement. Algorithm
7.5 displays the pseudocode for a generic distributed algorithm which works as
described.
The hierarchical clustering algorithms provide favorable results superior to the results
obtained from k-means algorithm, and also they have the advantage that the number
of clusters and the initial centroid locations is not required beforehand. However, the
fact that they have quadratic time complexity renders their use in biological sequences
which have large data sizes. Therefore, parallelization of hierarchical algorithms has
been a pursued topic of study for many researchers. However, the main difficulty
encountered is the need for global data access in these algorithms. We need to find
globally, out of all existing clusters, the closest clusters in each step which means
we need to check each cluster-to-cluster distance.
7.5 Distributed Clustering 147
The distance matrix D is already formed using a sequence alignment tool like
BLAST. This matrix as row-partitioned between these processes is given as below.
In the first step, the global minimum distance is 1 between the sequences S1 and S4 .
These two sequences are clustered and the distances to this new cluster are com-
puted as shown in Fig. 7.6. This processs continues until there is one cluster that
contains all of the clusters as shown in Figs. 7.7 and 7.8 and the final tree constructed
is shown in Fig. 7.9.
7.5 Distributed Clustering 149
S1 S2 S3 S4 S5 S6
(S1 , S4 ) S2 S3 S5 S6
S1 0 12 15 1 9 5 p0
(S1 , S4 ) 0 5 10 7 5 p0
S2 12 0 4 5 8 14
S2 5 0 4 8 14
S3 15 4 0 10 2 11 p1 →
S3 10 4 0 2 11 p1
S4 1 5 10 0 7 13
S5 7 8 2 0 6 p2
S5 9 8 2 7 0 6 p2
S6 5 14 11 6 0
S6 5 14 11 13 6 0
Fig. 7.6 Distributed AHC algorithm first and second rounds. The minimum distance is between
sequences S1 and S4 which are clustered to form the matrix on the right. The minimum distance in
the right matrix is between S3 and S5 which are clustered
(S1 , S4 ) S2 (S3 , S5 ) S6
(S1 , S4 ) (S2 , S3 , S5 ) (S6 )
(S1 , S4 ) 0 5 7 5 p0
(S1 , S4 ) 0 5 5 p0
S2 5 0 4 14 →
(S2 , S3 , S5 ) 5 0 6 p1
(S3 , S5 ) 7 4 0 6 p1
S6 5 6 0 p2
S6 5 14 6 0 p2
Fig. 7.7 Distributed AHC algorithm third and fourth rounds. The minimum distance in the left
matrix above is between clusters (S2 ) and (S3 , S5 ) which are clustered. The minimum distance now
is selected arbitrarily to be between clusters (S1 , S4 ) and (S6 ) as shown in the right matrix
(S1 , S4 , S6 ) (S2 , S3 , S5 )
(S1 , S4 , S6 ) 0 5 p0
(S2 , S3 , S5 ) 5 0 p1
Fig. 7.8 Distributed AHC algorithm last iteration. The minimum distance now is between clusters
(S1 , S4 , S6 ) and (S2 , S3 , S5 ) which are clustered to form the last cluster which contains all of the
clusters
Analysis
There are k processes as p0 , ..., pk−1 , and each process pi has n/k rows to work with,
each pi finds the smallest distance in its group in O(n2 /k) time. There are n − 1
rounds of the distributed algorithm, and the total time taken is therefore O(n3 /k).
The sequential algorithm had O(n3 ) time complexity, hence, the speedup obtained
is the processor number k ignoring the communication overheads.
(c)
(b)
(a)
(e)
(d)
Fig. 7.9 The partial trees constructed with the distributed algorithm in each round. The trees in
(a),…, (e) correspond to the rounds 1,…, 5. The tree in (e) is final (Not to the scale)
matrix D with an entry dij showing the distance between sequences i and j as before.
We will repeat this matrix here as follows.
S1 S2 S3 S4 S5 S6
S1 0 12 15 1 9 5
S2 12 0 4 5 8 14
S3 15 4 0 10 2 11
S4 1 5 10 0 7 13
S5 9 8 2 7 0 6
S6 3 14 11 13 6 0
The first step of the proposed parallel method involves constructing a full con-
nected graph K6 using these sequences as vertices and the edge weights denoting
their distances in D. We then search for an MST of this graph using a suitable algo-
rithm. The three algorithms to build an MST of a weighted graph are the Boruvka’s,
Prim’s, and Kruskal’s algorithms. Boruvka’s algorithm tests each node of the graph
and works by always selecting the lightest edges. Let us use Kruskal’s algorithm and
7.5 Distributed Clustering 151
sort edges in increasing weights. We include edges in MST starting from the lightest
edge as long as they do not create any cycles with the existing edges of the partial
MST and continue this process for (n − 1) times. The MST obtained from K6 is
shown in Fig. 7.10.
The output dendogram using the single-link agglomerative hierarchical clustering
algorithm for the same dataset is displayed in Fig. 7.11. We can see that these are
equivalent. In fact, the hierarchical clustering algorithm and the Kruskal’s algorithms
work using a similar principle, we always combine the closest pairs of clusters in the
clustering algorithm and we choose the lightest weight edge in the MST algorithm.
Cycle creating edges are discarded in the MST algorithm, similarly, two existing
nodes of a cluster are never processed in the clustering algorithm. The construction
of the MST dominates the time taken for the clustering algorithm and hence, we will
look at ways of parallelizing this step.
MST Construction
Considering the three algorithms to build MST which are Prim’s, Kruskal’s, and
Boruvka’s algorithms, it can be seen at first glance that Prim’s and Kruskal’s algo-
rithms process only one edge at a time and therefore are not suitable for independent
operations on processors, hence are difficult to parallelize. Boruvka’s algorithm on
the other hand provides independent operations on vertices and hence can be paral-
lelized more conveniently. For this reason, we will elaborate on Boruvka’s algorithm
which consists of the following steps:
1. Find the lightest weight edge (u, v) from each vertex u in G. The edge (u, v) is
included in MST.
2. Combine u and v to have a component which contains vertices u and v and the
edge (u, v) between them.
3. Combine components with their neighbor components using the lightest edge
(u, v) between them. The edge (u, v) is included in MST.
4. Repeat Step 3 until there is only one component.
Combining the components is called the contraction process and the edges used
in contraction constitute the MST. The operation of this algorithm is depicted in
Fig. 7.12 where the MST is formed only in two steps. The MST is unique as the edge
weights are distinct, and Kruskal’s or Prim’s algorithm will provide the same MST.
We can form a simple distributed version of this algorithm by distributing the
partitions of graph G to p processors each of which performs Boruvka’s algorithm
on its subgraph until they all have one component. They can then exchange messages
to perform contraction between them.
(a) (b)
2 11 4 12 2 11 4 12
a b c d e a b c d e
3 3
5 12 6 1 13 7 5 12 6 1 13 7
10 8 9 10 8 9
j i h g f j i h g f
Fig. 7.12 Execution of Boruvka’s algorithm in a sample graph. a The first step. b The second step.
Bold lines are included in the MST
7.5 Distributed Clustering 153
Algorithm 7.7 shows the operation of the distributed k-means using k processes
with p0 serving as the supervisor. Different than the generic algorithm, the supervisor
itself is also involved in the clustering process. In the initialization phase, p0 assigns
the initial m centroids c1 , ..., cm to the set M and sends the partitions rows along
with M to each worker process. Each process pi = p0 then works out the clusters
of the data points in its rows by computing their distances to the centroids, and
sends the partial cluster sets to the root at the end of the round, the root merges the
partial clusters from all workers to find the full clusters, calculates the new centroids
and sends these to the workers for the next round. This process continues until the
centroids are stable in which case p0 sends stop message to notify workers that the
process is over. It should be noted that data points are sent once in the first round
and only the newly calculated centroids are transferred in each round. Each process
should then assign the data points in its rows to the clusters based on these centroids
and send the partial clusters to the root at the end of a round. This process continues
until an objective function is met.
The similarity graph for a set of biological sequences can be constructed by connect-
ing two similar sequences by an edge. Our aim is now to find dense regions in this
graph to build clusters. There exists various algorithms for this purpose as we will
see in Chap. 11 when we investigate clustering in biological networks. For now, we
need to find a method such that any sequential graph algorithm can be performed in
parallel if we can partition the graph among the processes of the distributed comput-
ing system. There are numerous algorithms for this purpose, for example, we may
require to divide the graph into partitions such that the number of vertices in each
partition is approximately equal. We may have a different objective such as to have
a minimum number of edges between the partitions or minimum sum of weights of
edges if the graph is weighted. In this case, our aim is basically to seek clusters of
the graph. We can even pursue both of these requirements, that is, partitions should
have similar number of vertices with few edges between the partitions. If our aim
is simply load balancing to provide each process with nearly equal workload, then
partitioning for similar number of vertices in each partition would be sensible, es-
pecially if the clustering algorithm employed has a time complexity dependent on
the vertex number rather than edges, as frequently encountered in practice. We will
describe a simple method for this purpose next.
edges and also edges between the vertices of adjacent levels called inter-level edges
which are not included in the tree as shown in Fig. 7.13. As we iteratively run the
BFS algorithm, we can record the total number of nodes s searched up to that point,
and when s ≥ n/2, we place all of the nodes discovered so far in partition 1 and the
rest in partition 2. This will require a minor modification to the BFS algorithm and
will provide almost equal-sized partitions. As the line that represents this partition
will be just between the adjacent levels and not cross horizontal edges, we will have
fewer edges between the two partitions and hence satisfy the second requirement of
partitioning. Figure. 7.13 displays such a partitioning.
The time complexity of this algorithm is O(n + m) as the ordinary BFS algorithm.
As we frequently need to partition G into k partitions called k-way partitioning, we
need to run this algorithm recursively for each partition such that at log k steps, we
will achieve the required partitions. The total complexity will then be O(m log k)
assuming m ≥ n in a connected graph.
There are various available algorithms and software for the parallel clustering of data,
some aiming the biological sequences. These algorithms can be broadly categorized
as memory-based and disk-based as was noted before.
Density-based distributed clustering (DBDC) [3] algorithm aims to discover
arbitrary-shaped clusters. Data points are partitioned and sent to a number of proces-
sors which find the local clusters and send these back to the supervisor process which
uses density-based spatial clustering of applications with noise (DBSCAN) [8] to
combine the local clusters into global ones. The supervisor then sends the global
clusters to the workers for postprocessing. This process is very similar to the proce-
dure described in our generic distributed clustering algorithm of Algorithm 7.5. The
speedup obtained by DBDC is favorable with good quality results.
0 1 2 3 4
a b c d e
1 2 3 4
1 f g h i j inter-level edge
3 4 4
l m n
2 k
horizontal edge
Fig. 7.13 BFS-based partitioning of a graph. The tree rooted at vertex a have edges shown by bold
arrows, levels of vertices are shown next to them, and the dividing line is between levels 2 and 3
providing equal partitions of size 7
156 7 Clustering of Biological Sequences
For clustering biological sequences, the edit distance output by sequence align-
ment algorithms would provide us the similarities between the sequences. We can
therefore use any classical method with these calculated distances. However, biolog-
ical sequences have large sizes making it difficult to implement even the polynomial
time classical algorithms for this purpose. There are a number of clustering methods
targeting the biological sequences. Keyword-based clustering divides the sequence
into fixed-size segments and forms a vector with elements showing the frequencies
of these sequences. Various vector comparisons can then be employed to find the
similarities of the sequences and cluster them. Subsequence clustering attempts to
find common subsequences among the input and tries to cluster them using these
subsequences. A graph with nodes representing sequences and edges denoting their
similarities can be built and clustering is then reduced to finding dense regions of this
graph as these indicate closely related sequences. This process is called graph-based
clustering, and there are various algorithms to detect these dense regions as we will
see in Chap. 11 when we investigate clustering in biological networks.
In the final part of the chapter, we reviewed parallel and distributed clustering algo-
rithms for classical data clustering and biological sequence clustering. We proposed
a distributed single-link hierarchical clustering algorithm in which the distances be-
tween the clusters are computed in parallel. The second approach of this clustering
method is based on constructing the MST of the complete graph with edges showing
the distances. The popular k-means algorithm can be easily parallelized on distributed
memory computers as pursued in various research studies. Providing a distributed
version of graph-based clustering of biological sequences is more difficult due to
the dependencies of subtasks involved. More research is needed in this area, namely,
distributed graph-based clustering of biological sequences.
Exercises
5. The distance matrix for 6 data points is given as below. Assuming three processes
are available, work out clusters formed using the distributed AHC algorithm
executed in parallel by the three processes. Show every iteration of the distributed
AHC algorithm.
7.6 Chapter Notes 159
i h g f
S1 S2 S3 S4 S5 S6
S1 0 4 1 8 3 11
S2 4 0 11 14 5 6
S3 1 11 0 9 10 7
S4 8 4 9 0 13 4
S5 3 5 10 13 0 2
S6 11 6 7 4 2 0
6. Work out the two partitions formed by the BFS-based graph partitioning algorithm
of Sect. 7.5.3 for the graph shown in Fig. 7.16 for the source vertex a.
References
1. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR,
Wu A, Olde B, Moreno RF, Kerlavage AR, McConbie WR, Venter JC (1991) Complementary
DNA sequencing: expressed sequence tags and human genome project. Science 252:1651–1656
2. Asyali MH, Colak D, Demirkaya O, Inan MS (2006) Gene expression profile classification: a
review. Curr Bioinform 1:55–73
3. Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream
with noise. In: SDM Conference, pp 326–337
4. Charu C Aggarwal, Chandan K Reddy (ed) (2014) Data clustering algorithms and applications.
CRC Press, Taylor and Francis
5. Cordeiro RLF, Traina Jr C, Traina AJM, Lopez J, Kang U, Faloutsos C (2011) Clustering very
large multi-dimensional datasets with MapReduce. In: Proceedings of KDD, pp 690–698
6. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proc
eedings of OSDI, pp 139–149
7. Erciyes K (2014) Complex networks: an algorithmic perspective, pp 145–147. CRC Press,
Taylor and Francis, ISBN 978-1-4471-5172-2
8. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters
in large spatial databases with noise. In: Proceedings of the second international conference on
knowledge discovery and data mining, KDD, pp 226–231
9. Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science
315:972–976
10. Gordon A (1996) Null models in cluster validation. In: Gaul W, Pfeifer D (eds) From data to
knowledge. Springer, New York, pp 32–44
160 7 Clustering of Biological Sequences
11. Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann Publishers,
San Francisco
12. Jin C, Patwary Md MA, Agrawal A, Hendrix W, Liao W-K, Choudhary A (2013) DiSC: a
distributed single-linkage hierarchical clustering algorithm using MapReduce. In: Proceedings
of SC workshops, the fourth international workshop on data intensive computing in the clouds
(DataCloud 2013)
13. Kerr G, Ruskin HJ, Crane M, Doolan P (2008) Techniques for clustering gene expression data.
Comput Biol Med 38:283–293
14. Ketchen DJ Jr, Shook CL (1996) The application of cluster analysis in strategic management
research: an analysis and critique. Strateg Manag J 17(6):441–458
15. MacCuish JD, MacCuish NE (2010) Clustering in bioinformatics and drug discovery. CRC
Press, Taylor and Francis
16. Mardia K et al (1979) Multivariate analysis. Academic Press, London
17. Meila M, Heckerman D (1998) An experimental comparison of several clustering methods. In
Microsoft research report MSR-TR-98-06, Redmond WA
18. Murtagh F (2002) Clustering in massive data sets. In: Handbook of massive data sets, pp
501–543
19. Olman V, Mao F, Wu H, Xu Y (2009) Parallel clustering algorithm for large data sets with
applications in bioinformatics. IEEE/ACM Trans Comput Biol Bioinform 6:344–352
20. Papadimitriou S, Sun J (2008) DisCo: Distributed co-clustering with Map-Reduce: a case study
towards petabyte-scale end-to-end mining. Proc ICDM 2008:512–521
21. Pham TD, Wells C, Crane DI (2006) Analysis of microarray gene expression data. Curr Bioin-
form 1:37–53
22. Wang M, Zhang W, Ding W, Dai D, Zhang H, Xie H, Chen L, Guo Y, Xie J (2014) Parallel
clustering algorithm for large-scale biological data sets. PLoS One. doi:10.1371/journal.pone.
0091315
23. Xu R, Wunsch D II (2005) Survey of clustering algorithms. IEEE Trans Neural Networks
16(3):645–678
24. Zhang J, Wu G, Hu X, Li S, Hao S (2013) A Parallel clustering algorithm with MPI-MKmeans.
J Comput 8(1):10–17
25. ZhaoW, Ma H, He Q (2009) Parallel k-means clustering based on MapReduce. In: Proceedings
of CloudCom 2009, pp 674–679
Sequence Repeats
8
8.1 Introduction
The repeat analysis of DNA/RNA or proteins involves finding the location and deter-
mining the characteristics of the repetitive subsequences in these cellular structures.
Given a number of DNA sequences, we can concatenate them to form a single long
sequence and then perform repeat analysis. In this form, the repeat analysis problem
is similar to the problem of sequence alignment since the repeats in such a sequence
are the matching parts of the individual sequences. Repeating nucleotide sequences
in human DNA occur frequently and various other organisms have also repeating
subsequences.
The repeating patterns in human genome are basically of two types; tandem re-
peats and dispersed repeats. Tandem repeats are repeating adjacent subsequences
such as GATAGATAGATA where the subsequence GATA is repeated 3 times consec-
utively. They usually reside in the non-coding region of DNA but also in genes and
there is a high rate of change in the number of some repeats. Tandem repeats of amino
acids in proteins also exist and these vary from a single amino acid to 100 or more
residues. Tandem repeats in DNA also have varying sizes, from 1 bp to a complete
gene which is possible. Short tandem repeats (STRs), also called microsatellites, are
typically between 1 and 6 bp long and are considered as genetical markers of indi-
viduals [9]. Finding tandem repeats has many implications; they signal the location
of genes and they are assumed to have some function in gene regulation. They can
be used for identification of an individual in forensics and the increased number of
tandem repeats is associated with a number of genetically inherited diseases such
as fragile-X mental retardation, central nervous system deficiency, myotonic dystro-
phy, Huntington’s disease, and also with diabetes and epilepsy [8]. They can also be
used to identify complex diseases such as cancer [23,24]. Additionally, they are also
needed in population genetic studies where the number of repeats is similar between
related individuals.
The dispersed repeats are noncontiguous and these patterns are commonly called
sequence motifs. Such a sequence motif is a nucleotide sequence in a DNA or RNA
structure, typically in genes, or amino acid sequence in a protein that is found sta-
tistically more frequent than any other sequences. For example, a sequence motif of
the sequence CGATCAGCTATCCCTATCGGA is ATC which is repeated 3 times.
Such motifs can be searched in two ways: over-represented sequence repeats in a
single DNA or protein; or conserved motifs in a set of orthologous DNA sequences.
A structural motif of a protein on the other hand is usually formed by a certain se-
quence of amino acids to result in a three-dimensional physical shape of the protein.
A sequence motif in a gene may encode for a structural motif in a protein.
DNA sequence motifs are usually present in transcription factor binding sites
(TFBS) where proteins such as transcription factors bind to these sites to regulate
the expression of genes. Hence, discovery of such motifs helps to understand the
mechanisms of gene expression [17,34,35]. Motif discovery or motif extraction is the
process of finding recurring patterns in biological data. Motif discovery algorithms
can be classified as statistical and combinatorial. Statistical algorithms are based on
the computation of the probability of a motif existence and although they provide
suboptimal solutions, they are usually faster than exact algorithms due to restricted
search space. Combinatorial motif discovery algorithms on the other hand are exact
algorithms in the general sense. A subclass of combinatorial algorithms based on
graph theory has become popular recently as we will investigate. If the motif to be
searched is known beforehand, the motif discovery problem in this form is reduced
to pattern matching where we searched all occurrences of a pattern P in a text T as
we reviewed in Chap. 5.
In the general case of repeat analysis, a number of mutations in repeats are per-
missable as observed in practice. In this case, the search is for approximate repeats,
whether tandem repeats or sequence motifs, rather than exact ones. Searching re-
peating patterns is more directed toward nucleotide sequence repeats as amino acid
sequence repeats in proteins are not as dynamic as the DNA/RNA sequences. Yet,
another point to consider is whether the search of a motif is targeted toward a single
long sequence or we need to search conserved motifs in a set of relatively short
sequences. In summary, we have various facets of the sequence repeat problem as
stated below:
Naturally, the search method employed can have a possible combination of some
of these characteristics. For example, our aim may be to find pattern discoveries of
structured and dispersed motifs in a set of sequences using probabilistic methods.
In this chapter, we will first describe methods for finding tandem repeats mainly in
DNA sequences. We will then investigate motif discovery problem and then describe
sequential and distributed algorithms with focus on graph-theoretic motif search.
Tandem repeats are observed in both the genome and proteins. A tandem repeat that is
repeated more than twice is sometimes called a tandem array [32], we will, however,
use the term repeat to mean both. A tandem repeat is called primitive if it does not
contain any other tandem repeats, otherwise it is non-primitive. For example, GTAC
is a primitive tandem repeat and TATA is a non-primitive tandem repeat as it contains
another repeating pattern of TA. A repeating pair (i 1 , j1 ), (i 2 , j2 ) of a given sequence
S = s1 . . . sn implies si1 . . . s j1 = si2 . . . s j2 for tandem repeats and sequence motifs
[31]. The length of such a repeated pair is j − i + 1. Given a sequence AGCC ATCA
ATCA ATCA TCGCC, the tandem repeat is ATCA with starting indexes 5, 9, and
13, of length 4 and it is repeated 3 times.
Tandem repeats of DNA may be classified as perfect (exact) repeats or imperfect
(approximate) repeats in which the structure of tandem repeats may have been altered
due to mutations. The distance between the repeating sequences can be determined
either by the Hamming distance which is the number of mismatches between the
repeats, or the edit distance which is the minimum number of indels and substitutions
to transform one sequence to another, or by a scoring matrix which provides weighted
edit distances [4]. Tandem repeats can be classified as microsatellites (<10 bp),
minisatellites (between 10 and 100 bp), and satellites (>100 bp). Microsatellites
tend to be perfect, whereas minisatellites and satellites in general are approximate
repeats [13]. A short tandem repeat (STR) is a DNA sequence repeat of length
between 5 and 30 base pairs. The STRs in the Y chromosome of an individual are
164 8 Sequence Repeats
passed from father to son almost unchanged. STRs are commonly used in forensics
and genealogical DNA testing to discover identity of individuals. Tandem repeats
can be classified as follows [15].
1. Exact tandem repeats: These are in the form SS . . . S, for example, GCAT GCAT
GCAT.
2. k-approximate tandem repeats: These consist of repeats as S1 . . . Sn where each
Si has a distance (edit, hamming, or other) of at most k to a consensus repeat S.
For example, CAGT CAGG GAGC is a 1-approximate tandem repeat of three
sequences with the consensus sequence CAGT.
3. Multiple length tandem repeats: A multiple length tandem repeat is shown as
(Ss n )m where edit distance between S and s is greater than a threshold value t.
As an example, (CATTTG GACT GACT) (CATTTG GACT GACT) is a repeat
of this kind with n = 2 and m = 2. These repeats can also be approximate.
4. Nested tandem repeats: Considering two sequences S1 and S2 which have an
j
edit distance larger than t, these repeats are of the form S2i S1 S2 S1 . . . S1 S2z where
i, j, . . . , z > 1. S1 is the tandem repeat and S2 is termed the interspersed repeat.
The approximate values of these repeats are also possible.
Stoye and Gusfield presented simple time and space optimal algorithms using suffix
trees to locate tandem repeats (and arrays) in a sequence [32]. The algorithms are
based on the property of suffix trees that allow the efficient location of the branching
occurrences of tandem repeats. We will briefly review only the basic algorithm here
as in [32]. Given a sequence S, a tandem array shown as w = α k of S can be specified
8.2 Tandem Repeats 165
Fig. 8.1 Branching and non-branching repeats. When b = a, the tandem repeat is non-branching,
otherwise it is branching
as (i, α, k), meaning it starts at location i and is repeated k times. A tandem array is
right-maximal if there is no other repeat of α immediately after w and left-maximal
if there is no preceding α preceding w. A tandem array with two repeats is termed
a tandem repeat in this study and a tandem repeat is called branching when the
adjacent character proceeding the repeat is different than the first character of the
repeat, otherwise it is called a non-branching repeat as shown in Fig. 8.1.
Every tandem repeat is either branching or is included in a chain of tandem
repeats obtained by successive rotations starting from a branching tandem repeat.
The algorithm presented by the authors makes use of these concepts by first detecting
tandem repeats and then obtaining the required non-branching repeats from these
branching repeats. Suffix trees can be used to simplify this process; let T (S) be the
suffix tree of the sequence S and L(v) the path-label of a node v in T which is the
concatenation of the edge labels in the path from the root to the node v. The string-
depth of a node v is D(v) = |L(v)| and a leaf v is labeled with index i if and only if
L(v) = S[i..n]. For an internal node v, the leaf list L L(v) is defined as the list of the
leaf labels below v. Algorithm 8.1 shows the pesudocode of this algorithm.
The leaf list of v can be found by a linear traversal of the subtree rooted at v in a
time proportional to the size of L L(v). A depth-first-search traversal of T from the
root is performed and successive numbers called DFS numbers are assigned to the
leaves in their order of occurrences during traversal. These DFS numbers are stored
in an array indexed by the original leaf numbers. The next number to be given to a
leaf a is also stored at an internal node v during the first visit of v in a DFS traversal.
On return from the traversal through v, the most recent DFS number b assigned is
also stored at v. Using this approach, leaves of L L(v) all have numbers between a
and b.
The time for this algorithm is proportional to the total size of all the leaf lists which
is O(n 2 ) and a simple modification provides O(n log n) time bound [32]. The idea
of this algorithm was used to find tandem repeats using suffix arrays in [1] resulting
in less memory usage.
Sequence motifs are recurring DNA/RNA nucleotide or protein amino acid patterns
that have some biological significance. These motifs may represent binding sites for
DNA and binding domains for proteins. Discovering motifs in biological sequences
therefore helps to understand the biological functions such as transcriptional regula-
tion, mRNA splicing, and protein complex formation. It also provides an alternative
method to find the phylogenetic relationships between organisms by discovering
conserved motifs in them.
Finding binding sites in DNA is crucial in understanding the mechanisms that
regulate gene expression. Based on their functional importance, there are many re-
search studies in finding these sequence motifs resulting in numerous algorithms.
Frequently, the motif sequences undergo changes resulting in variations of the orig-
inal motif due to mutations. Hence, as in the case of tandem repeats, we can have
exact or approximate motif search. Furthermore, the DNA motifs may be simple or
8.3 Sequence Motifs 167
structured. A structured motif consists of a central pattern and one or two motifs
called satellites placed at some distance to the right and left of the central pattern
[16]. The central motifs and satellites are searched using different methods.The motif
finding problem has few variants described below:
1. Planted(l, d) Motif Search (PMS): Given are a set of n sequences S = {S1 , . . . , Sn }
over an alphabet , with m being the approximate length of all strings; and two
integers l and d, 0 ≤ d < l < m; a substring M ∈ Si with length l is searched
with variants Mi that are at most a Hamming distance of d from M which is
called the planted motif and Mi are called the planted variants of M.
2. Edited Motif Search (EMS): We are again given a set of n strings S = {S1 , . . . , Sn }
over an alphabet , and integers l, d, and q. Our aim is to search for motifs of
length l that have at most an edit distance of d between them and are incident in
at least q of the sequences S1 , . . . , Sn .
3. Simple Motif Search (SMS): In this problem, a simple motif consists of symbols
(residues) and ?’s which mean wild characters. The alphabet for DNA in this case
is {A, C, G, T, ?}. For example, GCC??T?A?A is a pattern of length 10 with
four wild characters. The “?” symbol can be replaced by any character from the
original alphabet. A motif cannot begin or end with a wild character. In SMS, we
are asked to find all simple motifs of maximum length l and the number of their
occurrences in an input set of sequences. The (u, v)-class in SMS is defined as
the class of simple motifs with length u having exactly v wild characters.
In an orthogonal setting, the motif finding algorithms are classified as pattern-
based algorithms and profile-based algorithms. In pattern-based algorithms, the aim
is to discover the motif subsequence whereas the positions of the motif are identified
in the latter [20]. Furthermore, based on the algorithmic techniques employed, we
can further classify the motif search algorithms as probabilistic and combinatorial.
The EMS problem has been addressed in some studies employing the pattern-
based algorithms mainly [19]. The Speller [28], Edit Distance Motif Search 1 and 2
(EDMS) [33], and Deterministic Motif Search (DMS) [26] are some of the example
studies. The Speller algorithm first constructs a generalized suffix tree for the input
set of sequences. It then searches this tree for subsequences of length l which have a
maximum edit distance d from each other and if there are at least q of such sequences,
they are reported as motifs.
The SMS problem is again studied in the context of pattern-based methods. In the
SMS algorithm presented by Rajasekaran et al. to find simple motifs [26], a (u, v)-
class is defined as a group of motifs where each motif has a length of u and has exactly
v wild characters. For example, ACC??TC?G is (9, 3) class. The algorithm performs
(u − 2v) sorts to find the patterns in a (u, v) class. More specifically, it consists of
the following steps. First, all subsequences of length less than l of the sequence S
are formed for each (u, v) class. Then, these subsequences are sorted considering
the wildcard positions only. The last step of the algorithm involves scanning through
the sorted list and counting the number of occurrences of each pattern. The time
complexity of the algorithms is reported as O((u − 2v)Mu/w) for a (u, v)-class
where M is the number of residues and w is the word length of the computer.
168 8 Sequence Repeats
Hamming distances to M
S1: A C C G A A C T T A G 5
S2: A G C G C T G T G A C 3
S3: C C C T A T G A G A A 4
S4: A T C G A T G C G T C 2
M : A C C G A T G T G T C
starting position of the motif is searched or the pattern-based algorithms which pre-
dict the motif sequence. The PMS series of algorithms are exact and use exhaustive
enumeration, whereas MEME and Gibbs samplings are probabilistic methods as we
will investigate.
Two probabilistic methods called MEME and Gibbs sampling are more frequently
used than others as described below.
8.3.1.1 MEME
Multiple expectation maximization for motif elicitation (MEME) is a software tool
for motif discovery in a set of DNA or protein sequences [2,3]. It uses the expec-
tation maximization (EM) algorithm to discover motifs in a number of sequences
by representing motifs as position weighted matrices (PWMs). The EM algorithm
is used in various applications to find the maximum-likelihood estimate of the pa-
rameters of a dataset distribution in general. The MEME algorithm does not need
the distribution of motifs in the input set of sequences before running the algorithm.
MEME constructs statistical models of motifs in which each model is represented
by a discrete probability distribution matrix M. An entry m i j of M shows the prob-
ability that character i will be present in the jth position of the motif. A high-level
pseudocode of MEME is shown in Algorithm 8.2 as adapted from [10].
Algorithm 8.2 M E M E
1: Input : Set of sequences S = S1 , ..., Sn
2: Output : Suffix tree T (S) representing S
3: for pass = 1 to n m do Loop 1
4: for w = wmin to wmax do Loop 2
5: for sp = 1 to n sp do Loop 3
6: estimate score of the model with the initial
7: end for end Loop 3
8: select the initial model with the maximum score
9: for λ = λmin to λmax do Loop 4
10: run EM for convergence
11: end for end Loop 4
12: select model with maximal G(.)
13: end for end Loop 2
14: select the model with maximal G(.) and report
15: delete the model from the previous probabilities
16: end for end Loop 1
170 8 Sequence Repeats
In the external for loop, multiple motifs are searched in the input set S1 , . . . , Sn .
All possible motif widths are searched between the lines 4 and 11 by generating a
model for each width. Then, a heuristic criterion function G(.) which is based on
maximum-likelihood ratio test is used to choose the best model. The search at a
specific motif width consists of two steps: in the first step between lines 5 and 7 of
the algorithm, the score of a model is estimated; and the expectation maximization
is executed until a convergence is reached in the second step between lines 9 and 11
after the selection of the model with the best score. The parameter λ is proportional
to the number of incidences of the motif in the sequences. After a motif is discovered,
the model is erased from the prior probabilities to prevent finding the same motif
again.
Parallel MEME
The randomness can be increased by selecting the new xi values randomly ac-
cording to the probability distribution of P in Si in line 8 of the algorithm. The Gibbs
sampling algorithm excludes sequence Si at each iteration and attempts to optimize
the motif location in Si using the profile values from all other sequences. In practice,
it often finds motifs in a set of sequences.
These methods typically provide exact solutions by exhaustive enumeration but they
may also employ heuristics to find approximate solutions. We will describe a basic
exact algorithm first, followed by graph-based algorithms which represent the simi-
larity among sequences by the edges of a graph and then attempt to discover dense
regions in this graph using heuristics.
WINNOWER
Pevzner and Sze proposed a graph-based motif detection algorithm for the Planted(l, d)
motif search problem in a set of sequences S = {S1 , . . . , Sn } called WINNOVER
[25]. The similarity graph G(V, E) is first constructed with nodes representing all
l-mer subsequences of the sequences. Two nodes u and v of G are connected if the
Hamming distance between them is at most d and they are from two different se-
2
quences. The expected distance between two motifs in fact is 2d − 4d3l . The graph
G built in this manner is n-partite as there will not be any edges between l-mers of
the same sequence. Since we are looking for a motif in all of the n sequences, such
a motif will be represented by a clique of size n in G. Let us illustrate this idea by a
simple example with three DNA segment inputs S1 , S2 , and S3 as shown below:
S1 = A C T T T C
S2 = T A C T T C
S3 = G C A G T T
8.3 Sequence Motifs 173
Fig.8.2 The 3-partite graph representing the three DNA segment 4-mers. The motifs ACTT, ACTT,
and AGTT have a maximum of 2 distance between them and form a clique shown in bold. The
consensus sequence for this motif is ACTT
We have n = 3 and m = 6 in this case and let us assume that we are searching for
motifs of length 4 and hence l = 4. We form consecutive 4-mers of these sequences
and construct a graph with 4-mers as nodes and connect the nodes from different
sequences that have a maximum hamming distance of 2 between them for d = 1. We
are therefore searching for the solution of Planted(4, 1) problem in these sequences.
The graph with these values is shown in Fig. 8.2 where the clique shown in bold
represents the motif.
Finding cliques in a graph is NP-hard, and a heuristic is used in WINNOWER
to find cliques. The authors, observing G is n-partite and almost random, proposed
the concept of extendable clique. A clique is extendable if there are neighbors of
this clique in each partition of the n-partite graph such that if C1 = {V1 , . . . , Vk } is
a clique, the neighbor u of C1 ensures that C1 = {V1 , . . . , Vk , u} is also a clique.
Based
on the observation that every edge in a maximal n-clique is part of a minimum
of n−2
k−2 extendable k-cliques, spurious edges are eliminated in each iteration of the
algorithm to yield cliques. The time complexity of this algorithm is O(N 2d+1 ) where
N = nm.
The SP-STAR algorithm [25] proposed by the same authors is also a graph-
based motif search algorithm that finds cliques, but different than WINNOWER, it
associates weights with edges and l-mers and can eliminate more spurious edges
than WINNOWER at each iteration. This algorithm is faster than WINNOWER and
uses less memory.
PRUNER is another graph-based algorithm for motif detection [30]. It also con-
structs a graph G based on pairwise similarity as WINNOWER and eliminates the
vertices that cannot be part of a clique. However, the approach in the second step
of the algorithm is different, in which it groups the edges of G as l-grams that are
174 8 Sequence Repeats
different in more than d positions into Group1 and less than or equal to d posi-
tions into Group2. The edges in these groups are removed iteratively to discover
the cliques. This algorithm has two versions as PRUNER-I and PRUNER-II; the
time and space complexities of the first one are reported as O(n 2 t 2 l d/2 ||d/2 ) and
O(ntl d/2 ||d/2 ), respectively, where d is the maximum number of allowed mis-
matches, t is the number of sequences, and l is the length of the motif. The second
algorithm has O(n 3 t 3l d/2 ||d/2 ) and O(l d/2 ||d/2 ) time and space complexities.
Structured motifs inference, localization and evaluation (SMILE) uses suffix trees
to discover structured motifs [14]. It constructs a generalized suffix tree T for the
set of sequences S first and traverses T to discover motifs. Its time complexity is
an exponential function of the number of gaps. Mismatch tree algorithm (MITRA)
uses efficient depth-first-search tree traversal to find the leaves which correspond
to motifs [6]. Its time complexity is O(l d+1 ||n N ) and space complexity O(n Nl)
where n is the number of input strings and N is the size of the strings with d denoting
maximum allowed number of mismatches for l-size motifs.
We have already described how the probabilistic MEME algorithm can be paral-
lelized using the SPMD model. We will now first describe a simple generic distributed
algorithm that can be used for any motif search method followed by a distributed
graph-based motif search algorithm proposal and then review some of the recent
parallel/distributed methods.
s3 s9
implement the WINNOVER algorithm for the nodes in its partition as shown in
Algorithm 8.6. The detected cliques with this size are the motifs.
Let us illustrate the operation of this algorithm using a simple example. A graph is
first built from the 3-mers of four input sequences S1 , S2 , S3 , and S4 where an edge in
the graph depicts a maximum distance of 2d between the nodes it connects as shown
in Fig. 8.3. The 3-mers from these sequences are s1 , . . . , s12 with s1 , s2 , s3 ∈ S1 , and
others similarly from S2 , S3 , and S4 in sequence.
Let us assume that we have four processes p0 , . . . , p3 . Each process pi with the
rows it is responsible are shown below.
Each process now checks the nodes it has with the nodes in other processes. It then
forms a vector for each row that has an element with entries shown as below. These
elements represent the nodes that have at least one connection to an l-mer of each
partition of the k-partite graph and are broadcast with their connected neighbors.
Each process then performs the WINNOVER algorithm for the nodes in its parti-
tion using the global connection data. The final cliques in this graph represented by
the nodes s1 , s5 , s8 , s12 and s2 , s6 , s8 , s10 are shown in bold. The node s8 is a member
of both cliques.
178 8 Sequence Repeats
The distributed algorithms for motif search are scarce; however, there has been few
research efforts in this direction recently. One such study is ACME by Sahli et al. [29]
which uses parallelization of suffix trees. It supports supermaximal motifs which
are motifs not included in any others. The model adapted is the supervisor–worker
model with thousands of worker processes. The claimed novelty of this algorithm
is the traversal order of the search space and the order of accessing data in the
suffix tree. The supervisor inputs the sequence data from the disk, decomposes the
search space, and generates tasks. Each worker receiving the input sequence builds
the annotated suffix tree and requests a task from the supervisor. The search space
is horizontally partitioned into a large number of sub-tries for each process. The
algorithm is implemented in MPI and also using multi-threading, and experimented
with the human genome of 2.6 GB size and protein sequence of 6 GB. It was assessed
as scalable based on the obtained results.
Nicolae and Rajasekaran presented a parallel exact algorithm for planted motif
search called PMS8 which can work on large values of l and d [21]. Exact PMS
algorithms frequently employ finding the common neighbors of three l-mers. The
PMS8 has two main steps as sample-driven and pattern-driven steps. In the first step,
tuples of l-mers from different sequences are generated, and in the pattern-driven
step of the algorithm, common d-neighborhoods of these tuples are formed. The
parallel implementation considers m − l + 1 independent subtasks. Equal number
of such subtasks can be assigned to each process. The only interprocess communi-
cation needed is the broadcasting of the input to processors. In order to manage load
balancing efficiently, process 0 is assigned as the scheduler and the other processes
as the workers. The scheduler inputs the sequences, broadcasts it to the workers,
and waits for requests from workers. Workers continue requesting subtasks from the
scheduler and solving it until no more subtasks remain. They send the motifs found
to the scheduler which outputs them in the end. This method was tested in a cluster
of 64 nodes using OpenMPI and was found to be scalable.
The exact algorithm PMSP which is based on PMS1 has been parallelized recently
in [18]. A bit vector mapping method which allows detection of large motif lengths
with greater Hamming distances is used in this study. The tasks which find the
motifs are distributed evenly to the cluster nodes, and both coarse gain and fine
grain parallelism methods are employed. The authors report that the evaluation on
two-node SMP system shows that the algorithm is scalable.
The simple motif search (SMS) problem can also be parallelized. There are var-
ious approaches for this task; in a simple and basic approach, the sequences can be
partitioned to a number of processes and the results found can be merged. Alterna-
tively, the sorting operations needed can be considered as tasks and each process can
handle one task under the arbitration of a central scheduling process which provides
load balancing.
A very recent study by Ikebata and Yoshida is called repulsive parallel Markov
Chain Monte Carlo (RPMCMC) algorithm and aims to find motifs by parallelizing
Gibbs sampling method in a distributed memory system [11]. The authors criticize the
8.3 Sequence Motifs 179
inability of the classical Gibbs sampling method to discover latent diverse motifs even
for a small number of inputs. This is claimed to be the result of posterior distribution
containing many locally high probability regions. The presented algorithm runs the
Gibbs motifs sampler in parallel by all-at-once sampling which discovers the diverse
motifs as shown by the experiments.
Repeating subsequences occur both in DNA nucleotide and protein amino acid se-
quences. These patterns are speculated to have some biological significance. Some
of the repeats in DNA indicate transcription factor binding sites and therefore discov-
ering them helps to find the location of genes. An increased number of such repeats
in DNA may signal existence of some diseases and finding them is helpful to deter-
mine the disease state of an organism. In forensics, discovery of repeats is frequently
used to identify an individual. Last but not least, search of repeats in a number of
biological sequences is used to compare them and find conserved regions in the or-
ganisms represented by these sequences which may be used to infer phylogenetic
relationships along them.
A tandem repeat consists of repeating patterns, whereas a sequence motif is dis-
persed in various locations in DNA or protein. The search of these repeats may be
viewed from several different angles. First, the pattern may be known before the
search or not, and in the former case, the problem is reduced to pattern matching,
whereas the latter is the general case. We may be looking for approximate repeats
both in tandem repeats or motifs, rather than exact ones, to allow mutations and this is
what is commonly encountered in reality. We may want to discover the pattern itself
only as in pattern-based search, or its locations in the sequence as in profile-based
search or both. Our aim may be to extract repeats in a single sequence or to find
repeats in a number of sequences to find their similarities. Yet, another distinction is
based on the method used, and we can employ an exact algorithm which finds every
occurrence of a repeat or a heuristic algorithm that only provides an approximate
solution.
Statistical algorithms provide approximate solutions and combinatorial algo-
rithms present exact results in general, but combinatorial methods may also employ
heuristic approaches to result in approximate results. We reviewed few sequential
representative algorithms: an exact combinatorial algorithm, suffix tree-based tan-
dem repeat, and motif extraction algorithms, and the probabilistic algorithms MEME
and Gibbs sampler. We then investigated ways of parallelizing repeat search process
on distributed memory processors. A simple method is based on the supervisor–
worker model of parallel processing and relies on partitioning dataset of sequences
to a number of processes, and letting each worker process search for pattern in its
subset. In another approach, we discussed how the suffix trees can be constructed
and analyzed for repeating patterns in parallel. Using graphs to detect repeats opens
another door for parallelism as bringing the problem to this domain provides some
180 8 Sequence Repeats
already existing parallel methods such as searching for cliques in parallel in a graph
to our immediate use. All l-mers of a sequence are formed as a partition of a k-partite
graph for k sequences and edges are connected between close l-mers. A clique in such
a k-partite graph represents a motif. We proposed a distributed algorithm that finds
the cliques in a k-partite graph in k rounds. The repeat search in DNA sequences and
also in proteins will continue to be a fundamental area of research in bioinformatics
due to the imperative information gained about molecular biology by the analysis of
these repeats.
Exercises
1. Implement Stoye and Gusfield algorithm to find the tandem repeats in the follow-
ing DNA sequence. Draw the suffix tree for the sequence and mark the locations
of the repeats in this tree.
A A T T A T T A T T C C G A
2. Find the Planted(3, 1) motif in the following DNA sequence using the naive
algorithm by checking every subsequence.
T G G A A G G C T G G A
3. Work out the consensus sequence of the following protein sequences. Work out
also the distance of each sequence to the consensus sequence. What should be
the value of the maximum distance d for these four sequences to be considered
as a planted(11, d) motif?
S1: R C Q E P D X F D H N
S2: H C B M I V E Q D H M
S3: R C B P L V X S D H M
S4: A W K M P V X Q Z H M
4. Draw the similarity graph for the following DNA sequences for distance d = 1
and 4-mers. Check this graph visually for the existence clique of size 4 and if it
exists, test whether the vertices of this clique are Planted(7, 1) motifs.
S1: C C C G A G C
S2: G G A G C T G
S3: C C C T A T G
S4: T A G C A T G
5. Sketch the high-level pseudocode for the parallel MEME algorithm as described
in Sect. 8.3.1.
8.4 Chapter Notes 181
S1: G A C G A A C G T
S2: A G A C C T G A C
S3: C G A T A T G C A
S4: A T C C A C G C T
References
1. Abouelhoda MI, Kurtz S, Ohlebusch E (2002) The enhanced suffix array and its applications
to genome analysis. In: Proceedings of WABI 2002, LNCS, vol 2452, pp 449–463. Springer
2. Bailey TL, Elkan C (1995b) The value of prior knowledge in discovering motifs with MEME. In:
Proceedings of thethird international conference on intelligent systems for molecular biology,
pp 21–29. AAAI Press
3. Bailey TL, Elkan C (1995a) Unsupervised leaning of multiple motifs in biopolymers using
EM. Mach Learn 21:51–80
4. Chun-Hsi H, Sanguthevar R (2003) Parallel pattern identification in biological sequences on
clusters. IEEE Trans Nanobiosci 2(1):29–34
5. Crochemore M (1981) An optimal algorithm for computing the repetitions in a word. Inf
Process Lett 12(5):244–250
6. Eskin E, Pevzner PA (2002) Finding composite regulatory patterns in DNA sequences. Bioin-
formatics 18:354–363
7. Floratos A, Rigoutsos I (1998) On the time complexity of the TEIRESIAS algorithm. In:
Research report RC 21161 (94582), IBM T.J. Watson Research Center
8. Gatchel JR, Zoghbi HY (2005) Diseases of unstable repeat expansion: mechanisms and com-
mon principles. Nat Rev Genet 6:743–755
9. Goldstein DB, Schlotterer C (1999) Microsatellites: evolution and applications, 1st edn. Oxford
University Press, ISBN-10: 0198504071, ISBN-13: 978-0198504078
10. Grundy WN, Bailey TL, Elkan CP (1996) ParaMEME: a parallel implementation and a web
interface for a DNA and protein motif discovery tool. Comput Appl Biosci 12(4):303–310
11. Ikebata H, Yoshida R (2015) Repulsive parallel MCMC algorithm for discovering diverse
motifs from large sequence sets. Bioinformatics 1–8: doi:10.1093/bioinformatics/btv017
12. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC (1993) Detecting
subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262:208–
214
13. Lim KG, Kwoh CK, Hsu LY, Wirawan A (2012) Review of tandem repeat search tools: a
systematic approach to evaluating algorithmic performance. Brief Bioinform. doi:10.1093/
bib/bbs023
182 8 Sequence Repeats
14. Marsan L, Sagot MF (2000) Algorithms for extracting structured motifs using a suffix tree with
application to promoter and regulatory site consensus identification. J Comput Biol 7(3/4):345–
360
15. Matroud A (2013) Nested tandem repeat computation and analysis. Ph.D. Thesis, Massey
University
16. Mejia YP, Olmos I, Gonzalez JA (2010) Structured motifs identification in DNA sequences.
In: Proceedings of the twenty-third international florida artificial intelligence research society
conference (FLAIRS 2010), pp 44–49
17. Modan K, Das MK, Dai H-K (2007) A survey of DNA motif finding algorithms. BMC Bioin-
form 8(Suppl 7):S21
18. Mohantyr S, Sahu B, Acharya AK (2013) Parallel implementation of exact algorithm for planted
(l,d) motif search. In: Proceedings of the international conference on advances in computer
science, AETACS
19. Mourad E, Albert YZ (eds) (2011) Algorithms in computational molecular biology: techniques,
approaches and applications. Wiley series in bioinformatics, Chap. 18
20. Mourad E, Albert YZ (eds) (2011) Algorithms in computational molecular biology: techniques,
approaches and applications, pp 386–387. Wiley
21. Nicolae M, Rajasekaran S (2014) Efficient sequential and parallel algorithms for planted motif
search. BMC Bioinform 15:34. doi:10.1186/1471-2105-15-34
22. Pardalos PM, Rappe J, Resende MGC (1998) An exact parallel algorithm for the maximum
clique problem. In: De Leone et al (eds) High performance algorithms and software in nonlinear
optimization, vol 24. Kluwer, Dordrecht, pp 279–300
23. Parson W, Kirchebner R, Muhlmann R, Renner K, Kofler A, Schmidt S, Kofler R (2005)
Cancer cell line identification by short tandem repeat profiling: power and limitations. FASEB
J 19(3):434–436
24. Pelotti S, Ceccardi S, Alu M, Lugaresi F, Trane R, Falconi M, Bini C, Cicognani A (2008)
Cancerous tissues in forensic genetic analysis. Genet Test 11(4):397–400
25. Pevzner P, Sze S (2000) Combinatorial approaches to finding subtle signals in DNA sequences.
In: Proceedings of the eighth international conference on intelligent systems on molecular
biology. San Diego, CA, pp 269–278
26. Rajasekaran S, Balla S, Huang C-H, Thapar V, Gryk M, Maciejewski M, Schiller M (2005)
High-performance exact algorithms for motif search. J Clin Monitor Comput 19:319–328
27. Rajasekaran S, Balla S, Huang C-H (2005) Exact algorithms for planted motif problems. J
Comput Biol 12(8):1117–1128
28. Sagot MF (1998) Spelling approximate repeated or common motifs using a suffix tree. In:
Proceedings of the theoretical informatics conference (Latin98), pp 111–127
29. Sahli M, Mansour E, Kalnis P (2014) ACME: Efficient parallel motif extraction from very long
sequences. VLDB J 23:871–893
30. Satya RV, Mukherjee A (2004) New algorithms for Finding monad patterns in DNA sequences.
In: Proceedings of SPIRE 2004, LNCS, vol 3246, pp 273–285. Springer
31. Srinivas A (ed) (2005) Handbook of computational molecular biology. Computer and infor-
mation science series, Chap. 5, December 21, 2005. Chapman & Hall/CRC, Boca Raton
32. Stoye J, Gusfield D (2002) Simple and flexible detection of contiguous repeats using a suffix
tree. Theor Comput Sci 1–2:843–856
33. Thota S, Balla S, Rajasekaran S (2007) Algorithms for motif discovery based on edit distance.
In: Technical report, BECAT/CSE-TR-07-3
34. Zambelli F, Pesole G, Pavesi G (2012) Motif discovery and transcription factor binding sites
before and after the next-generation sequencing era. Brief Bioinform 14:225–237
35. Zhang S, Li S, Niu M, Pham PT, Su Z (2011) MotifClick: prediction of cis-regulatory binding
sites via merging cliques. BMC Bioinf 12:238
Genome Analysis
9
9.1 Introduction
Basic sequence analysis we have analyzed until now involved pattern matching, com-
paring two or more sequences by alignment, clustering of sequences, and searching
for repeats. Our aim in this chapter is to analyze genome and proteins at a higher,
more macroscopic level of subsequences rather than at nucleotide and amino acid
level. We will investigate the genome at three coarser and more observable levels:
the genes, genome rearrangements, and haplotyping.
A gene is basically a subsequence of DNA which encodes for a protein. Transfer
of genetic information from a gene to a protein is called the gene expression. Genes
are the coding parts of DNA and constitute about 3 % of DNA. Gene finding or
gene prediction is the process of discovering the location and contents of genes in a
DNA strand where the latter term in the general sense implies that some statistical
processing is involved during search. Finding genes helps to understand their function
better and we will see there are a number of algorithms for this purpose.
Point mutations in which a nucleotide changes by substitution or indels are not
the only evolutionary changes in DNA. Various changes at subsequence level, such
as reversal or exchange of subsequences, are some of the dynamic events which may
cause generation of new organisms and also may be one of the reasons behind various
diseases. This process in which parts of a DNA are modified at subsequence level
is termed genome rearrangement and discovering these macroscopic mutations is
one of the fundamental research areas in bioinformatics as it gives us insight into the
evolutionary process in a larger time frame and understanding of the disease states.
A haplotype is part of a single chromosome of a genome. The biotechnical methods
usually produce data from both chromosomes of an organism for practical and cost-
effective reasons, however, the sequence information from a haplotype only is needed
as it is meaningful in the study of genetic information for a number of applications
including disease analysis. Obtaining data of a haplotype from genetic data is called
haplotype inference and there are many methods to accomplish this task. We start
this chapter by first describing fundamental sequential and distributed methods for
© Springer International Publishing Switzerland 2015 183
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_9
184 9 Genome Analysis
Genes are the basic units of hereditary information distributed over the chromosomes.
There are over twenty thousand genes in human genome and functioning of about
50 % of them is not fully understood. In humans, chromosome 1 is the largest and
contains the highest number of genes in excess of 3,000 whereas the Y chromosome
of males is the smallest and contains only about few hundred genes.
Genes are placed between long sequences of DNA that are believed to be nonfunc-
tional. The genome of an organism is the whole DNA including functional (genes)
and nonfunctional subsequences. The size of the human genome is about three billion
base pairs (bps) of nucleotides. A complementary copy of the nucleotide sequence
in a gene is made in the transcription process. The nonfunctional parts of the tran-
scribed DNA sequence is thrown out in the splicing process as commonly observed
in eukaryotes, to result in the messenger RNA (mRNA) sequence which is forwarded
to the protein synthesis process. Each codon of the mRNA is translated into an amino
acid sequence which makes up a protein in the final translation process. These con-
secutive steps of transcription, splicing, and translation are termed as the central
dogma of life.
The structure of a gene is depicted in Fig. 9.1. The functional, coding regions in
a gene called exons are placed between the nonfunctioning sequences called introns
both of which have varying lengths, introns commonly being much longer in length
than exons in humans. The splicing process involves excising out the introns from
the transcribed sequence and the exons are combined to form the mRNA. The mRNA
coding region always starts with the same codon of ATG to generate the amino acid
methionine and ends with one of the three stop codons of TAA, TAG, or TGA which
are not translated into amino acids. There are 20 amino acids and 64 codons which
means few codons correspond to the same amino acid as shown in the genetic code.
The amino acid sequence that is translated is called a polypeptide and a protein
consists of one or more polypeptides. The gene sequence between the start codon
and the stop codon of a gene is called the open reading frame (ORF).
As a gene has start and stop codons specifying its beginning and ending locations,
gene finding can be performed by simply searching for the start codon ATG over the
genome and once this is found, we can search for one of the stop codons to locate
the whole gene. In this case, the problem is reduced to that of pattern matching,
however, this time consuming approach is hardly used on its own due to the huge
size of genome. There are various problems associated with gene finding. First,
measurements may be erronous resulting in false signals. Genes may overlap, that
is, a part of a gene may be a part of another gene. Moreover, a gene may be nested
in another gene. There are occasions where several genes may encode for a single
9.2 Gene Finding 185
Fig. 9.1 The structure of a gene. Splicing forms the mRNA which is used for translation process
Gene finding in prokaryotes is relatively easier as they have smaller genomes and
their genes occupy a much higher percentage of the genome than eukaryotes. On
the contrary, finding genes in eukaryotes is more difficult as genes are more sparsely
distributed, separated by long noncoding DNA segments. Moreover, the gene in
an eukaryote consists of exons and introns and therefore the coding region is not
contiguous as we have seen. Also, detecting the gene signals such as promoters are
more difficult due to their more complex structure than in prokaryotes. Our main
emphasis in this section is gene finding in eukaryotes.
Algorithms to find genes can be broadly classified as ab initio (“from the begin-
ning” in Latin) methods and comparison-based methods. Ab initio methods typically
employ statistical algorithms that search for certain signals associated with genes.
Two such commonly used sequence information are the signal sensors and content
sensors. Signal sensors are the sequence repeating patterns such as the ones found in
transcription binding sites; signals related to splice sites, and start and stop codons.
We have already seen methods to detect sequence repeats in Chap. 8 and such algo-
rithms can be used to find the repeats in transcription binding sites to detect genes.
186 9 Genome Analysis
The first exon of a gene starts with the codon ATG and ends with one of the three stop
codons: TAA, TAG, or TGA. Moreover, the boundary between an exon and an intron
of a gene is always signaled by the sequence AG. Using this information, it is pos-
sible to statistically determine the frequent occurrences of these signals and deduce
genes. Content sensing is statistically determining the coding sequences such as the
usage of codons to determine exon sequences. There are several ab initio methods
to find genes using these concepts such as dynamic programming, Hidden Markov
Models (HMMs), and neural networks (NNs).
Comparison-based gene finding is based on finding similarities between the input
genome and proteins or other genomes. The general idea here is that the coding parts
of a genome are more conserved through the evolution than the other noncoding
parts. Therefore, the similar parts of two or more genomes or proteins may indicate
shared genes between two or more species. Two major approaches for comparison
are either between the amino acid sequence that may be produced by a genome and
some known protein sequences; or between the query and a number of known genome
sequences. To find similarity of sequences, local or global alignment algorithms we
have seen in Chap. 6 can be used. BLAST family of algorithms can be adopted for
this purpose [1], and as another example, the GeneWise program performs global
alignment of the translated genomic sequence of the query to a homologous protein
sequence [8].
A Markov chain is a state machine where the future state is only dependent on the
current state and the probability of moving from the current state to the future state.
The sum of the probabilities of transitions from a state is equal to unity. Formally, a
Markov chain is a triplet (Q, p, A) such that
• Q is a finite set of states from an alphabet .
• p is the initial state probabilities.
• A is the state transition probabilities with elements ast for transition from state s
to t such that ast = P(xi = t|xi−1 = s).
The probability of each variable xi is dependent only on the value of the preceding
variable xi−1 . The probability of observing a given sequence of events in a Markov
chain is the product of all observed transition probabilities as shown below.
L
Pr(x1 ) = Pr(xi |xi−1 ) (9.1)
i=2
A Hidden Markov Model (HMM) is a nondeterministic state machine that emits
variable-length sequences of discrete symbols. It is a 5-tuple (Q, V, p, A, E) with Q,
p, and A as defined above and additionally, V and E as below:
9.2 Gene Finding 187
Nature inspired methods mimic the processes in nature, in this case, the biochemical
processes in organisms.These methods are considered as ab initio procedures as they
predict the location of a gene with some probability. Two such methods are the use
of artificial neural networks (ANNs) and genetic algorithms (GAs) described next.
188 9 Genome Analysis
. .
. .
. .
. . . .
.
.
9.2 Gene Finding 189
Gene recognition and analysis internet link-I (GRAIL-I) is an early study employ-
ing ANNs for gene prediction. It uses a multilayer feedforward ANN and in the later
improved version called GRAIL-II, the problem of missing short exons in GRAIL-I
was solved using variable-length windows. Gene identification using neural nets and
homology information (GIN) [12] and GENSCAN [11] are the tools that also utilize
ANNs for gene prediction. More recent work using ANNs to detect genes are in
[39,42]. A detailed survey of these methods can be found in [23].
Cabbage 1 -5 4 -3 2
1 -5 4 -3 -2 (inversion of 2)
1 -5 -4 -3 -2 (inversion of 4)
Turnip 1 2 3 4 5 (reversal and inversion of -5,...,-2)
As we have seen in this example, sorting by reversals may result in forming new
species and is more frequently observed in unichromosomal genome rearrangements
than all other rearrangements in a single chromosome.
For example, with π = 1 2 | 5 4 3 | 6 7, there are two breakpoints at (2, 5) and (3,
6) locations. Approximation algorithms to find the minimum number of reversals to
convert one permutation to another generally rely on removing breakpoints from a
permutation to sort it. Every reversal removes at most two breakpoints, and hence:
Definition 9.3 (strip) A strip is an interval [i, j] where (i − 1, i) and (j, j + 1) are
breakpoints with no other breakpoints between them. In other words, a strip is the
maximal increasing or decreasing subsequence that does not contain any breakpoints.
1. for i = 1 to n − 1
2. if πi = i then
3. do a reversal to put i in ith place in π
Given a permutation π = 3 4 1 5 2, the steps of this naive algorithm would be as
below.
3 4 1 5 2
1 4 3 5 2
1 2 5 3 4
1 2 3 5 4
1 2 3 4 5
k among all decreasing strips of π is selected and the strip between k and the element
k − 1 of π is reversed.
0 3 4 6 5 1 2 7 k = 5, k-1 = 4
0 3 4 5 6 1 2 7 reverse an increasing strip
0 6 5 4 3 1 2 7 k = 3, k-1 = 2
0 6 5 4 3 2 1 7 k = 1, k-1 = 0
0 1 2 3 4 5 6 7
shows that as long as we have a decreasing strip in π , we can reduce the number
of breakpoints by performing a reversal on this strip. When there are no decreasing
strips, we can no longer reduce the number of breakpoints. In the 2-approximation
algorithm provided by Kececioglu and Sankoff for unsigned reversals using break-
points [33], the reversal that removes the highest number of breakpoints is chosen in
each iteration until the identity permutation is obtained as shown in Algorithm 9.2
[33].
Choice of such reversals is not trivial and we will show the detailed operation of this
algorithm as in [45]. Let us assume that the permutation π contains decreasing strips
with Sm being the string with the maximum element of all elements in decreasing
strips, and Sk as the minimum element containing strip as before. Let Sk−1 be the
strip containing the element k–1 and Sm−1 containing m+1. Let also ρk be the reversal
that merges Sk and Sk−1 ; and ρm be the reversal that merges Sm and Sm−1 . If neither
ρm × π nor ρk × π contains any decreasing strip, then the reversal ρk = ρm removes
two breakpoints. This test is done in line 6 of Algorithm 9.2 shown in detail below.
In the first two cases, number of breakpoints is reduced by at least one and the last
case reduces it by two.
1. if π × ρk contains a decreasing strip then
2. π ← π × ρk
3. else if π × ρm contains a decreasing strip then
4. π ← π × ρk
5. else
6. π ← ×π(ρk = ρm )
7. reverse any increasing strip in π to get a decreasing strip
If a step of the algorithm forms a π without a decreasing strip, this step has
reduced b(π ) by two. This algorithm will provide a two decrease in the number of
196 9 Genome Analysis
breakpoints for every two reversals. Therefore, there will be at most b(π ) reversals
using this algorithm and since the optimum number of reversals is at least b(π )/2,
the approximation ratio is b(π )/b(π )/2 = 2.
In signed permutations, each element of the genome can have a positive or a negative
sign as described to reflect the orientation in the order of genome elements such as
genes. Our problem again is to sort a signed permutation with minimum number
of traversals. A 1.5-approximation algorithm to find signed reversal distance was
provided by Bafna and Pevzner [4] using breakpoint graphs. Hannenhalli and Pevzner
showed in 1995 that this problem can be solved exactly in polynomial time with O(n4 )
time complexity [28]. The running time of the polynomial algorithm was improved
to O(n2 ) by Kaplan et al. [31] and the computation time of the reversal distance
was obtained as O(n) by Bader et al. [3]. The polynomial time algorithms used the
concept of the breakpoint graph as described next.
(a)
0 3 1 6 5 4 2 7
(b)
0 1 2 3 4 5 6 7
Fig. 9.4 The breakpoint graphs for the permutation π = 0 3 1 6 5 4 2 7. These two representations
are equivalent
and therefore to the solution. The reversal distance problem is hence reduced to
maximal cycle decomposition problem. The reader is referred to [4,5] for a detailed
analysis of this algorithm. A modified breakpoint graph can also be used for sorting
by transpositions [15].
( 0 2 4 -3 6 5 -1 7)
oriented pairs
0 1 2 3 4 5 6 7 identity permutation
0 1 2 3 4 5 6 7 identity permutation
The difficult part of this algorithm is the determination of which oriented pair to
apply the reversal with. We need to check each pair and count the pairs in the resulting
permutation had we applied this reversal. We then need to find the one that gives the
maximum score. For sequences with many genes, this would be time consuming.
Bergeron showed that if applying Algorithm 9.3 to a permutation π using k reversals
yields a permutation π , then d(π ) = d(π )+k. A second algorithm based on hurdles
was also presented in [6]. A framed interval of a permutation π is defined as follows:
where all values between i and i + k are in the interval [i...i + k]. For example,
in the permutation π = 2 4 3 1 5 7 6; [ 2 4 3 1 5 ], [ 3 1 5 7 ], and [ 1 5 7 ] are
examples of framed intervals. A hurdle is defined as a framed interval that does not
contain any shorter framed intervals. The last sequence in our example is a hurdle.
Hurdle cutting is the process of reversing the segment inside the hurdle and hurdle
merging is defined as reversing the segment between the end point of a hurdle and
the beginning point of another hurdle. Furthermore, a simple hurdle is defined as a
hurdle, cutting of which decreases the number of hurdles. Using these concepts, a
second algorithm defined in [6] is as follows. This algorithm together with the first
one can be used to sort signed permutations.
1. if a permutation π contains 2k hurdles (k ≥ 2) then
2. merge any two nonadjacent hurdles
3. else if π has 2k+1 hurdles (k ≥ 1) then
4. if it has one simple hurdle then
5. cut the hurdle
6. else if it has no hurdles then
7. merge two nonadjacent hurdles or adjacent ones if k = 1
process pi has finished its task in the current round before the next round starts, as
in line 23 of the algorithm.
A reported work about a distributed implementation of the breakpoint graph
method is in [17]. The authors propose first the construction of the breakpoint graph
B(π ) in parallel. The optimum distance in B(π ) is also computed in parallel in the
second step. The authors state that this second step of the algorithm is the parallel
version of the algorithms in [9,44]. Further algorithms are used to find the possible
next reversals sequence and a parallel algorithm in the last step inputs the results
from the first three stages and outputs the final sorting of the signed permutation.
There are no implementation details and no analysis information given.
Our DNA consists of 23 pairs of chromosomes with each pair consisting of a chro-
mosome inherited from father and a chromosome from mother. Humans are diploid,
meaning they have pairs of chromosomes. A locus or a site is a location in a pair
of chromosomes. The observable physical characteristics of an organism such as its
appearance and behavior is called its phenotype. For example, the color of the hair
and the blood group type of an individual belong to her phenotype.
position 1 2 3 4 5 6 7
P1 A G C C T G A
A T G G C C T
P2 A G C G A G A
G C G A T C T
The implementation steps of this algorithm for a sample genotype set is shown
below.
Clark’s algorithm found many applications when first proposed and is still used
for inferring haplotypes. It can be used for large data but requires haplotypes to be
not very diversified as this would result in prolonged processing. One immediate
problem with this algorithm is that it requires the existence of homozygote or a
heterozygote genotype with one heterozygote allele to start. Also, it may not give
unique solutions due to the order of processing.
9.4.3 EM Algorithm
applied to haplotype inference here, referring the reader to the mentioned references
for detailed analysis. The EM algorithm aims to find the maximum-likelihood esti-
mate of the parameters of a data set distribution in general. The implementation of
the EM algorithm for genotype phasing consists of the following main steps:
1. Guess the initial haplotype frequencies
2. repeat
3. Use the current guess to compute the expected number of occurrences.
4. Compute the new estimates of haplotype frequencies using the expected num-
ber of occurrences.
5. until the haplotype frequencies converge
The complexity for one step of the EM algorithm can be stated as O(nk), where n
is the number of genotype data and k is the maximum number of heterozygous loci
in the genotypes to be analyzed.
The method to be employed would again be the partitioning of the data space to
perform distributed haplotype inference. We will describe an implementation of
distributed Clark’s algorithm and show how EM algorithm can be designed to run in
parallel on a distributed computer system.
We have briefly reviewed three distinct genome analysis problems. The first problem
was finding the location and contents of genes and this is needed as the first step
of gene analysis. We saw the two main methods for this purpose as ab initio and
comparison-based approaches. Ab initio methods attempt to discover statistically
more frequently appearing subsequences in or around gene regions. HMMs are a
widely used ab initio scheme, and ANNs and GAs provide alternative techniques for
gene prediction.
Comparative methods assume that the evolutionary changes in the genes are slower
when compared to the mutations in the noncoding parts of the genome. Therefore,
comparing two similar genomes, we can detect regions of high similarity using
sequence alignment methods, and these are potentially the coding sequences of
the genome. We need to do post-processing of these similarity regions to detect
the starting and stopping locations of the gene. There are hardly any algorithms
for parallel/distributed gene finding, however, we can conveniently employ any
9.5 Chapter Notes 207
1. Discuss how and why sequence alignment methods can be used to find genes in
a number of closely related organisms.
2. Given the permutation π = 4 2 1 3 5 6, convert π to the identity permutation using
the naive algorithm. Show each step of the algorithm.
3. For the permutation π = 3 2 4 5 6 1 8 7, mark the increasing and decreasing
strips. Then, implement the 4-approximation algorithm to reduce π to the identity
permutation.
4. For the permutation π = 4 3 1 2 5 7 8 6, mark the increasing and decreasing
strips. Then, implement the 2-approximation algorithm to reduce π to the identity
permutation.
5. Draw the breakpoint graph for the permutation π = 5 4 1 2 6 8 3 7 9 by showing
the black and gray edges.
6. Sketch a distributed version of the oriented pair-based algorithm using the
supervisor–worker model with the supervisor involved in computation. Show
208 9 Genome Analysis
References
1. Altschul SF, Gish W, Miller W, Myers EW et al (1990) Basic local alignment search tool. J
Mol Biol 215(3):403–410
2. Axelson-Fisk M (2010) Comparative gene finding: models, algorithms and implementation:
Chap. 2, Computational Biology Series, Springer
3. Bader DA, Moret BME, Yan M (2001) A linear-time algorithm for computing inversion dis-
tance between signed permutations with an experimental study. In: FKHA Dehne, J-R Sack,
R Tamassia (eds) WADS, LNCS, vol 2125. Springer, pp 365–376
4. Bafna V, Pevzner PA (1993) Genome rearrangements and sorting by reversals. In: Proceedings
of the 34th annual symposium on foundations of computer science, pp 148–157
5. Bafna V, Pevzner PA (1996) Genome rearrangements and sorting by reversals. SIAM J Comput
25(2):272–289
6. Bergeron A (2005) A very elementary presentation of the Hannenhalli-Pevzner theory. Discrete
Appl Math 146(2):134–145
7. Berman P, Hannenhalli S, Karpinski M (2002) 1.375-approximation algorithm for sorting by
reversals. In: Proceedings of the 10th annual european symposium on algorithms, series ESA
02, Springer, London, UK, pp 200–210
8. Birney E, Durbin R (2000) Using GeneWise in the Drosophila annotation experiment. Genome
Res 10:547–548
9. Braga MDV, Sagot M, Scornavacca C, Tannier E (2007) The solution space of sorting by
reversals. In: Mandoiu II, Zelikovsky A (eds) ISBRA 2007, vol 4463, LNCS (LNBI)Springer,
Heidelberg, pp 293–304
10. Brunak S, Engelbrecht J, Knudsen S (1991) Prediction of human mRNA donor and acceptor
sites from the DNA sequence. J Mol Biol 220(1):49–65
11. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J
Mol Biol 268(1):78–94
12. Cai Y, Bork P (1998) Homology-based gene prediction using neural nets. Anal Biochem
265(2):269–274
13. Caprara A (1997) Sorting by reversals is difficult. In: Proceedings of the 1st ACM conference
on research in computational molecular biology (RECOMB’97), pp 75–83
14. Christie DA (1998) A 3/2-approximation algorithm for sorting by reversals. Proceedings the
ninth annual ACM-SIAM symposium on Discrete algorithms, series SODA 98. Society for
Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 244–252
15. Christie DA (1999) Genome Rearrangement Problems. Ph.D. thesis, The University of Glasgow
16. Clark AG (1990) Inference of haplotypes from PCR-amplified samples of diploid populations.
Mol Biol Evol 7:111–122
References 209
17. Das AK, Amritanjali (2011) Parallel algorithm to enumerate sorting reversals for signed per-
mutation. Int J Comp Tech Appl 2(3):579–589
18. Day RO, Lamont GB, Pachter R (2003) Protein structure prediction by applying an evolutionary
algorithm. In: Proceedings of the international parallel and distributed processing symposium
19. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the
EM algorithm. J R Stat Soc 39(1):1–38
20. Dobzhansky T, Sturtevant A (1938) Inversions in the chromosomes of drosophila pseudoob-
scura. Genetics 23:28–64
21. Duc DD, Le T-T, Vu T-N, Dinh HQ, Huan HX (2012) GA_SVM: a genetic algorithm for
improving gene regulatory activity prediction. In: IEEE RIVF international conference on
computing and communication technologies, research, innovation, and vision for the future
(RIVF)
22. Excoffier L, Slatkin M (1995) Maximum-likelihood estimation of molecular haplotype fre-
quencies in a diploid population. Mol Biol Evol 12(5):921–927
23. Goel N, Singh S, Aseri TC (2013) A review of soft computing techniques for gene prediction.
Hindawi Publishing Corporation ISRN Genomics, vol 2013, Article ID 191206. http://dx.doi.
org/10.1155/2013/191206
24. Gusfield D (2002) Haplotyping as perfect phylogeny: conceptual framework and efficient
solutions. In: Proceedings of the 6th annual international conference computational biology,
pp 166–175
25. http://genes.mit.edu/GENSCAN.html. The GENSCAN Web Server at MIT
26. http://www.fruitfly.org/seq_tools/genie.html. The Genie web server
27. http://www.genezilla.org/. The GeneZilla web server
28. Hannenhalli S, Pevzner PA (1999) Transforming cabbage into turnip: polynomial algorithm
for sorting signed permutations by reversals. J ACM 46(1):1–27
29. Hawley ME, Kidd KK (1995) HAPLO: a program using the EM algorithm to estimate the
frequencies of multi-site haplotypes. J Heredity 86(5):409411
30. Henderson H, Salzberg S, Fasman KH (1997) Finding genes in DNA with a Hidden Markov
Model. J Comput Biol 4(2):127–141
31. Kaplan H, Shamir R, Tarjan RE (2000) A faster and simpler algorithm for sorting signed
permutations by reversals. SIAM J Comput 29(3):880–892
32. Karayiannis NB, Venetsanopoulos AN (1993) Artificial neural networks, learning algorithms,
performance evaluation, and applications. Springer Science+Business Media, New York
33. Kececioglu J, Sankoff D (1993) Exact and approximation algorithms for the inversion distance
between two permutations. In: Proceedings of the 4th annual symposium on combinatorial
pattern matching, volume 684 of Lecture Notes in Computer Science, Springer, New York,
pp 87–105
34. Krogh A (1997) Two methods for improving performance of a HMM and their application for
gene finding. In: Gaasterland T, Karp P, Karplus K, Ouzounis C, Sander C, Valencia A (eds)
Proceedings of the fifth international conference on intelligent systems for molecular biology.
AAAI Press, Menlo park, CA, pp 179–186
35. Krogh A (1998) An introduction to hidden Markov models for biological sequences. In:
Salzberg SL, Searls DB, Kasif S (eds) Computational methods in molecular biology Chapter
4 . Elsevier, Amsterdam, The Netherlands, pp 45–63
36. Long JC, Williams RC, Urbanek M (1995) An E-M algorithm and testing strategy for multiple-
locus haplotypes. Am J Hum Genet 56(3):799–810
37. Mourad E, Albert YZ (eds) (2011) Algorithms in computational molecular biology: techniques,
approaches and applications. Wiley Series in Bioinformatics, Chap 33
38. Nielsen J, Andreas Sand A (2011) Algorithms for a parallel implementation of Hidden Markov
Models with a small state space. IPDPS Workshops 2011:452–459
39. Palaniappan K, Mukherjee S (2011) Predicting essential genes across microbial genomes: a
machine learning approach. In: Proceedings of the IEEE international conference on machine
learning and applications, pp 189–194
210 9 Genome Analysis
40. Palmer JD, Herbon LA (1988) Plant mitochondrial DNA evolves rapidly in structure, but slowly
in sequence. J Mol Evol 28(1–2):87–97
41. Perez-Rodriguez J, Garcia-Pedrajas N (2011) An evolutionary algorithm for gene structure
prediction. Industrial engineering and other applications of applied intelligent systems II
6704:386–395
42. Rebello S, Maheshwari U, Safreena Dsouza RV (2011) Back propagation neural network
method for predicting lac gene structure in streptococcus pyogenes M group A streptococcus
strains. Int J Biotechnol Mol Biol Res 2:61–72
43. Schulze-Kremer S (2000) Genetic algorithms and protein folding. Protein Struct Prediction
Methods Mol Biol 143:175–222
44. Siepel AC (2002) An algorithm to find all sorting reversals. Proceedings of the 6th annual
international conference computational molecular biology (RECOMB 2002). ACM Press, New
York, pp 281–290
45. Sung W-K (2009) Algorithms in bioinformatics: a practical introduction. CRC Press, Taylor
and Francis Group, pp 230–231
46. Trinca D, Rajasekaran S (2007) Self-optimizing parallel algorithms for haplotype reconstruc-
tion and their evaluation on the JPT and CHB genotype data. In: Proceedings of 7th IEEE
international conference on bioinformatics and bioengineering
Part III
Biological Networks
Analysis of Biological Networks
10
10.1 Introduction
complex networks which are networks with very large number of nodes and edges.
We then define the fundamental problems in biological networks which are mod-
ule detection, network motif search, and network alignment which constitute the
main topics in this part of the book along with phylogeny.
A cell contains many different types of chemical compounds with DNA located
at its nucleus. We have already seen in Chap. 2 the central paradigm of molecular
biology in which genetic code in DNA is read by the RNA polymerase complex and is
transcribed into the corresponding RNA. In the following translation phase, amino
acid sequences which make up the proteins are produced by the ribosomes that bind
to messenger RNA. The proteins formed interact with each other forming protein–
protein interaction (PPI) networks. Important biological networks at cell level are
related to DNA, RNA, genes, proteins, and metabolites. We will briefly describe the
main networks in the cell which are metabolic networks, gene regulatory networks,
and PPI networks in the next sections.
The metabolism of an organism is its basic chemical system that produces essential
cell ingredients, such as sugars, lipids, and amino acids. Metabolic networks model
all possible biochemical reactions in the cell to generate metabolisms. The nodes in
these networks are the biochemical metabolites and the edges represent either the
reactions which convert one metabolite to another or the enzymes that catalyze these
reactions [9,25,28]. The graphs representing the metabolic networks can be directed
or undirected depending on the reversibility of the reaction representing an edge.
A metabolic pathway is a sequence of biochemical reactions to perform a specific
metabolic function. For example, glycolysis is one such metabolic process where
energy supply is generated in which glucose molecules are broken into two sugars
and these sugars are used to generate adenosine triphosphates (ATPs). Metabolic net-
works have small diameters and hence exhibit small-world characteristics in which
the average path length between any two nodes is small. They also have few high-
degree and many low-degree nodes as found in scale-free networks. Figure 10.1
displays the simplified metabolic reaction network of E. coli which shows these
properties.
The study of metabolic networks and pathways within these networks is an impor-
tant area of research in biomedicine as understanding the metabolic mechanisms in
the cell helps to find cures for diseases. The infections can be controlled for exam-
ple, by discovering the differences between the metabolic networks of humans and
pathogens causing the infections [12].
10.2 Networks in the Cell 215
Fig. 10.1 The simplified metabolic reaction network of E. coli where endogenous metabolites are
the nodes and the edges represent the reactions (taken from [16])
Genes code for proteins via transcription and translation, which are essential for the
functioning of an organism. This process is known as gene expression. The regulation
of gene expression can be done at transcriptional, translational or post-translational
stages of gene expression. Gene expression is controlled by proteins produced by
other genes resulting in regulatory interactions. In simple terms, gene A regulates
gene B if a change in expression of gene A induces a change in the expression
of gene B. This regulation can take two forms; it can be an up-regulation which
activates the gene expression process in the other gene, or down-regulation in which
case the expression is inhibited. A gene regulation network (GRN) consists of genes,
proteins, and other small molecules which all make up the nodes of the network, and
their interactions form the edges. The regulators are usually proteins and referred to
as transcription factors. A GRN is commonly represented by a directed graph where
arrows show the direction of regulation. Figure 10.2 shows a GRN with three genes.
The GRNs are sparse, that is, they have low density which means genes are reg-
ulated by few other genes. The out-degrees of nodes in a GRN follows power-law
and scale-free properties meaning there are only few high out-degree nodes in these
networks and the rest of the nodes have low out-degrees. The maximum distance
between the nodes, the diameter of the gene networks, is low which is a property
of the small-world networks. The GRNs of the lysis/lysogeny cycle regulation of
216 10 Analysis of Biological Networks
Fig. 10.2 A gene regulatory network with three genes A, B, C; three mRNAs 1, 2, 3; and three
proteins X, Y, and Z. Gene A regulates gene B by protein X at transcription, gene B regulates gene
C at translation by protein Y, and gene C regulates gene A at post translation by protein Z to modify
protein X
Proteins consist of sequences of amino acids and they carry out vital functions such
as acting as enzymes for catalysis of metabolic processes, signaling compounds, or
serving as transporters for various substances like oxygen in the cell. A PPI network is
built from interacting proteins. The interaction is needed in the cell for the activation
of a protein by another one or to build protein complexes which are a group of
proteins that form clusters. An onset of a disease is usually related to alterations
in certain protein–protein interactions. This observation necessitates the study of
protein interactions from a system level to understand the disease process better.
Detecting protein–protein interactions have been traditionally studied using phys-
ical methods such as affinity chromatography or co-immunoprecipitation [12,19,29].
Two techniques which are widely used currently are the yeast two-hybrid (Y2H)
system analysis of protein complexes and affinity purification coupled to mass spec-
trometry (AP-MS) method. The Y2H system uses the structures of transcriptional
10.2 Networks in the Cell 217
factors and AP-MS method is based on the purification of protein complexes used
with mass spectrometry [12]. A PPI network can be modeled conveniently by an
undirected graph G(V, E) where V is the set of proteins and E is the set of edges
showing the interactions. The graph is undirected as it is usually not possible to
determine which protein binds the other, in other words, which protein influences
the other.
As can be seen in Fig. 1.2 of Chap. 1, this sample PPI network and PPI networks in
general are not homogenous. There are central nodes with many number of connec-
tions and the majority of nodes have few connections. The central nodes provide the
connections between the nodes with few connections. Although the number of nodes
in a PPI network as in any molecular biological network is very large, any node can
reach any other node with only few number of hops. We can therefore state that these
networks have small diameters. Indeed, this physical property is very useful in fast
transmission of signals between proteins. On one hand, we need these central nodes
function properly for the healthy state of an organism. On the other hand, if the PPI
network is formed or modified by the disease state of an organism, the therapy, and
the drug design method should target these central nodes as their failure will help
stop the functioning of the disease process. The PPI networks have this small-world
property and the scale-free property as commonly found in other molecular biologi-
cal networks [6,24]. We can also detect groups of nodes with high density which are
called clusters. These clusters may have some important functionality attributed to
them as there is a strong interaction in these groups. Another area of interest is the
search of repeating patterns of subgraphs in these networks which are called motifs.
These motifs may have certain attributed functionality and they may be the building
blocks of these networks. Alignment of two or more networks is the network analogy
of the sequence alignment we have reviewed in Chap. 6, our aim in this case is to
compare and find how similar two or more networks are. Aligning PPI networks of
different organisms helps us to discover the evolutionary relationships between them
as well as detecting their functional similarities. We can therefore specify main topo-
logical problems in PPI and molecular biological networks as detecting the central
nodes and the clusters, searching motifs, and aligning these networks.
In this section, we review the main biological networks outside the cell which are
the networks of the brain, phylogenetic networks, and the food web.
the brain networks at a coarser level. The structures and implications of these two
networks will be described next.
achieved by using the neural network models alone and models at much coarser level
are needed as described next.
(a) (b)
Fig. 10.4 a A rooted phylogenetic tree of organisms A,…, F. b A rooted phylogenetic network of
organisms P,…, T
All living organisms need to consume food for survival. Plants make their own food
using a process called photosynthesis. Many organisms however, are not able to
produce their own food and they rely on other organisms such as plants or animals
for survival. A food chain displays the relationships between the producers and the
consumers. An arrow in a food chain diagram indicates the predator-prey relationship
with the arrow pointing to the predator. The food chain in sea is shown in Fig. 10.5
in simplified form. At the bottom, the plankton which is a microscopic plant is
consumed by small fish and the small fish is a prey to the big fish. When big fish
dies, it is consumed by bacteria which provide nutrition to the environment to be
consumed by planktons.
In general, many consumers intake more than one type of food such as humans
who eat plants and animals and therefore many food chains are interconnected. A
food web is a complex network of many interconnected food chains. Cycles are
common in a food web as the dead organisms are consumed by other organisms
which provide nutrition to some other organisms. A food web is considered as a
static networks as this relation is hardly altered over time. One fundamental research
topic in food webs as biological networks is on investigation of the effect of removal
of an organism from a food web, which does happen due to major climate changes.
We can investigate the properties of biological networks from the view of graph
theory as global or local properties. Global properties reflect characteristics of a
biological network when considered as a whole. Local properties on the other hand,
describe the node properties specific to the surroundings of a node. It is however
possible to deduce global network properties from local network properties in many
cases.
10.4.1 Distance
The distance d(u, v) between two nodes u and v of a graph is defined as the shortest
path between them. For undirected graphs, this distance is specified as the minimum
number of edges (hops) and for weighted graphs, it is the sum of the weights of
the minimum-weight path between the two vertices. There may be more than two
such paths between the vertices u and v which is more common for unweighted
and undirected graphs. The distance between the two vertices in two directions may
not be equal in a directed graph and hence the relation is not symmetric in general.
There will be n(n − 1) number of such distances, counting in both directions. The
distance from a node to all other nodes can be computed using Dijkstra’s shortest
path algorithm we saw in Sect. 3.6 in O(m + n log n) time. Computation for all
nodes can be done by running this algorithm for all nodes. The average distance of
a graph G, dG (av), is the average of all distances between each pair of nodes in G.
This parameter provides us with the information on how easy it is to reach from one
vertex to another.
222 10 Analysis of Biological Networks
The degree of a vertex v, δ(v) is defined as the number of edges incident to v. For
directed graphs, the in-degree of a vertex v is the number of edges that end at v
and similarly, the out-degree of v is the number of edges that start at v. The average
degree, δ(av), of a graph is the average of all degrees.The sum of degrees of vertices
of an undirected graph is equal to twice the number of edges (2m) as we count them
in both directions. The density of a graph, ρ(G), is defined as the number of its edges
to the maximum possible number of edges as 2m/(n(n − 1)). We can therefore state
that ρ(G) = δ(av)/(n − 1). In dense graphs, ρ does not change significantly as n
gets large, otherwise the graph is called sparse.
The degree sequence of a graph specifies the list of its vertices in decreasing
or increasing order, usually in decreasing order. Isomorphic graphs have the same
degree sequence but graphs with the same degree sequence may not be isomorphic.
We will need to generate similar graphs randomly to test our algorithms so that
we can evaluate the significance of the results obtained in biological networks. The
degree sequence is one parameter we try to conserve while generating these random
test graphs as we will see in Chap. 13.
The degree distribution is another important global property of a graph. This
parameter specifies the percentage of the vertices with the same degree to the total
number of vertices as follows: mk
P(k) = , (10.1)
n
where the number of vertices with degree k is shown by m k . The degree dis-
tribution is Binomial in random networks. Figure 10.6 displays these concepts. In
assortative networks, the high-degree nodes tend to connect to high-degree nodes
like themselves as in social networks. Conversely, in disassortative networks, the
high-degree nodes have the tendency to form connections with low-degree nodes. In
molecular interaction networks, the disassortative property is prevalent, for example,
high-degree nodes called hubs in PPI networks are usually linked to nodes with few
connections.
(a) (b)
Fig. 10.6 a A sample graph with node degrees is shown. The average degree for this graph is 2.63,
the degree sequence is 5, 4, 3, 3, 2, 2, 1, 1 and its density is (2 × 11)/(8 × 7) = 0.39. b The degree
distribution of this graph
10.4 Properties of Biological Networks 223
The clustering coefficient of a graph provides more information about the connec-
tivity of a graph than its average degree. The clustering coefficient of a vertex v,
CC (v), is the ratio of the connections between its neighbors to the maximum possi-
ble connections between these neighbors, as follows.
nv
CC(v) = , (10.2)
kv (kv − 1)
where kv is the number of neighbors vertex v and n v is the number of existing edges
between them. The clustering coefficient shows the density of connections around a
vertex and it reflects the connectedness of the neighborhood of a vertex. Clustering
coefficient is 0 for a vertex with no connected neighbors such as the central vertex
of a star network and it is 1 for any vertex of a complete network.
The average clustering coefficient, or the clustering coefficient of a graph as more
commonly used, is the average value of the node clustering coefficients and shows
the tendency of the network to form clusters. It is the probability that there is an edge
between the two neighbors of a randomly selected vertex. This parameter is defined
as follows:
1
CC(G) = CC(v) (10.3)
n
v∈V
Figure 10.7 shows the clustering coefficients of the vertices of a sample graph. The
average clustering coefficient of the nodes with degree k is the clustering function
C(k). In [21], the clustering coefficients for the metabolic networks of 43 organisms
were calculated and all were found to be about an order of magnitude higher than
expected for a scale-free network of similar size.
The matching index shows the similarity of two nodes in a network. The matching
index between two vertices u and v of a graph G is defined as the ratio of the
total number of common neighbors of these two vertices to total number of their
distinct neighbors. The vertices b and f of Fig. 10.7 have a and e as their common
neighbors and they have five vertices a, c, e, g, h as distinct neighbors, therefore,
their matching index Mb f is 2/5 = 0.4. We can extend this concept to neighborhoods
of vertices to more than one hop and define k-hop matching index as the ratio of
common neighbors of two vertices that are within k hops to all distinct neighbors
that are within k hops. We can then consider two vertices as similar if their matching
indices to all other vertices in the network are approximately equal. The matching
index can be used to relate different parts of a network based on some property. It
has been used to analyze spatial growth in BFNs and to predict the connections of
private cortical networks [17].
10.5 Centrality
The degree centrality of a node is simply its degree in the network. The degree
of a vertex provides us with information about its importance in the network. For
example, a high-degree protein in a PPI network is involved in many interactions
with its neighbors and removal of such a node from this network may have lethal
effects than removing a protein with a lower degree [12,13]. The degree centrality of
the nodes of a graph G(V, E) can be computed using its adjacency matrix A as the
sum of the rows of this matrix yields degree centrality values of the nodes. In matrix
notation, the centrality vector C = A × [1], that is, it can be formed by multiplying
the adjacency matrix with a vector of all 1’s.
The degree centrality shows the local importance of a node. We can calculate the
average degree of a network which will give us some idea about the structure of the
nodes, however, the graph structure as a whole will be difficult to predict from this
parameter. For example, a PPI network has few protein nodes with very high degrees
and the rest of the nodes have only few connections to their neighbors. Evaluating the
average degree centrality for such a network will not reveal this scale-free structure.
In general, this metric alone is not sufficient to evaluate the functional importance
of a node.
10.5 Centrality 225
where σst is total number of shortest paths between vertices s and t, and σst (v) is
the total number of shortest paths between vertices s and t that pass through vertex v.
For an unweighted graph, we could run a modified version of the BFS algorithm in
which we keep records of multiple shortest paths between the source and destination
node pairs (s, t). Afterwards, we can find the ratio of the shortest paths that pass
through a vertex to find its betweenness centrality. The multiple shortest paths to
each node is found in a sample graph in Fig. 10.9. We then find the ratio of shortest
paths through each vertex and the sum of these values yield the betweenness value
226 10 Analysis of Biological Networks
(a) (b)
(c) (d)
(e) (f)
Fig. 10.8 Closeness centrality example. The sum of the shortest path values are shown next to the
source nodes. The closeness centrality values for vertices a, . . . , f are 1/7,1/5,1/8,1/8,1/7, and 1/9
respectively. The vertex b has the largest closeness centrality which can also be seen visually as it
is closer to all vertices than all others
for vertices. For example, vertex c in (d) has the shortest paths (a, d), ( f, d), (b, d),
and (e, d) running through it. There is one shortest path between a and d so the
contribution for this path is 1. Similarly, (b, d) and (e, d) paths are unique raising
the betweenness of c to 3. However, the ( f, d) path has two alternatives both of
which pass through c and each contribute 0.5 resulting in a betweenness value of 4
for node c. We need to sum all values for shortest paths to all source nodes which
gives 8 for node c.
(a) (b)
(c) (d)
(e) (f)
Fig. 10.9 Vertex betweenness centrality example. The betweenness values computed for vertices
a, . . . , f are 0, 6, 8, 0, 2, 1.5 with vertex c having the highest value and vertex b with a close value
to c which can also be seen visually as these vertices have several shortest paths running through
them
means v has an extra shortest path to s other than through u. In this case the weight
of v is made equal to the sum of the weights of u and v as it has multiple shortest
paths running through it. The total weight of a vertex is the sum of weights found
for all source vertices.
Having found vertex weights, the next step is to determine the edge weights
representing edge betweenness values. In the algorithm proposed by the authors,
the leaf vertices from the vertex labeling algorithm are first identified, and for each
vertex v that is a neighbor to such a leaf vertex u , the edge (u, v) is assigned a weight
wu /wv . Then, moving upwards towards the source vertex s, each edge (u, v) with u
being farther to s than v, is assigned a weight that is 1 plus the sum of the weights
of edges below it multiplied by wv /wu . This procedure continues until s is reached.
Algorithm 10.2 displays the pseudocode for this algorithm.
Fig. 10.10 Edge betweenness centrality example. For the source vertex a, the vertex weights are
computed using Algorithm 10.1 and then the edge weights are computed using Algorithm 10.2.
This process has to be repeated for all vertices and the edge betweenness value for an edge is the
sum of all the values obtained
This process is repeated for all source vertices and the edge betweenness value
of an edge is the sum of all edge betweenness values found for each source vertex
which can be performed in O(mn) time. In order to find clusters of a network, the
final step involves removing the edge with the highest edge betweenness value from
the graph at each step, until the clusters that meet a required criteria are discovered
giving a total time complexity of O(m 2 n) or O(n 3 ) on sparse graphs. Figure 10.10
displays the vertex weights and edge betweenness values obtained in a sample graph
for the source vertex a using this algorithm.
where N (i) is the set of neighbors of node i and ai j is the ijth entry of the adja-
cency matrix A of the graph G(V, E) which shows neighborhood of the nodes. This
equation can be rewritten in matrix notation as:
Ax = λx (10.8)
230 10 Analysis of Biological Networks
The random graph model was proposed by Erdos and Renyi in 1950s. This model
assumes that there are n vertices, {v1 , . . . , vn }, and m edges are to be formed in the
network. An edge (vi , v j ) is placed between vertices vi and v j with probability p =
2m/(n(n − 1) and this process is repeated for each pair of vertices. It has been shown
that the degree distribution in these networks is Binomial which can be approximated
by a Poission distribution for large networks as shown in Fig. 10.11. Also, the average
clustering coefficient, or simply the clustering coefficient, is inversely proportional
to the size of the network in random networks and the average path length is small,
proportional to the logarithm of the network size. However, many real-world complex
networks do not exhibit the homogenous degree distribution and small clustering
coefficient observed in these networks.
10.6 Network Models 231
(a) (b)
Networks with small average path length usually with co-existing large cluster-
ing coefficients are called small-world networks. The average shortest path length
between the nodes of such networks is small compared with the size of the network
and usually is proportional to log n. PPI networks, metabolic networks, and gene
regulation networks have small average path lengths and large average clustering
coefficients. Watts and Strogatz proposed a model (WS model) which accounts for
small average path length and high clustering coefficient as observed in many real-
life complex networks including biological networks [31]. A simple way to generate
a small world network based on this model is to start from a regular lattice type of
network where each node is placed on a one dimensional ring connected to its n/2
neighbors, providing high clustering coefficient. Then, rewiring of a node to one of
the distant vertices with a probability pn is provided. The network becomes closer
to an ER random network with low clustering coefficient and short average path
distance proportional to the logarithm of the network size when pn increases, and it
becomes an ER random network when pn → 1 as shown in Fig. 10.12. However, the
WS model exhibits only these two properties of biological networks, namely, high
Fig. 10.12 WA model. a A regular network which has high average clustering coefficient but also
large average path length. b A rewired scale-free network with high average clustering coefficient
and small average path length. c A network approaching ER random network as rewiring is increased
232 10 Analysis of Biological Networks
clustering coefficient and short average path length but fails to exhibit another very
important property of these networks as described next.
Many biological networks have the power-law degree distribution where the degree
distribution has the form:
P(k) ∼ k −γ , γ > 1 (10.9)
where γ is termed the power-law exponent. The plot of degree distribution displays
a heavy-tailed curve in these networks which means there are few nodes with high
degrees and most of the nodes have low degrees. The networks with power-law
degree distribution are called scale-free networks. The majority of the nodes in these
networks have only few neighbors and a small fraction of them have hundreds and
sometimes thousands of neighbors. Many biological networks are scale-free, for
example, degree distributions of the central metabolic networks of 43 organisms
were shown to have heavy tails in agreement with Eq. 10.9 with 2 < γ < 3 [9]. The
PPI networks of E. coli, D. melanogaster, C. elegans and H. pylori were also shown
to be scale-free [6].
The high-degree nodes in scale-free networks are called hubs. In a PPI network,
the removal of a hub protein will presumably have an important effect for the survival
of the network than the removal of a protein with low degree. In simplest terms, the
PPI network will be disconnected and the transfer of signals between many of the
nodes will cease. A hub in a PPI network may be formed as a result of a disease
and detecting of such disease-related hubs is important in a PPI network as drug
therapies can be targeted to them to stop their functioning.
Barabasi and Albert was first to propose a method to form scale-free networks
[2]. The so-called Barabasi-Albert model (BA model) assumes that these networks
are dynamic in nature and given an initial network G 0 , the dynamics of the network
is governed by two rules:
This method provided forming of a scale-free network with power law P(k) ∼
k −γ with γ = 3 [2]. The BA model mainly shows the distribution of dynamically
evolving complex networks including biological networks. However, some issues
have been raised about the validity of this model for biological networks [15]. The
techniques used to identify interactions in biological networks are error-prone and
only samples of these networks can be evaluated. These samples may not reflect the
10.6 Network Models 233
Fig. 10.15 Network motifs a Feed-forward loop found in PPI networks and neural networks.
b Three-node feedback found in GRNs. c Three chain; found in food webs. d Four-node feedback
found in GRNs
networks of two species and assess their similarity based on their topological struc-
ture resemblance. However, exact topological matches is not possible because of
the dynamic nature of these networks which results in various modifications such
as addition or removal of edges or nodes. Furthermore, the measurements to detect
interactions in them are error-prone resulting in detection of false positive or negative
interactions. For this reason, we need to assess their similarity level rather than test-
ing whether they match exactly. This problem is related to subgraph isomorphism
problem which is NP-hard and hence, heuristic algorithms are usually preferred.
Global network alignment is the comparison of two or more networks as a whole
to determine their overall similarity as shown in Fig. 10.16. Local alignment algo-
rithms on the other hand, aim to discover conserved structures across species by
comparing their subnetworks. Two species may have a very similar subnetwork dis-
covered by a local alignment algorithm whereas they may not be similar as a whole
when searched by a global alignment algorithm. Network alignment is basically the
process of finding the level of similarity between two or more networks.
In this chapter, we first described the main networks in the cell which are the metabolic
networks, gene regulation networks, the PPI networks and then various other net-
works outside the cell. We then defined the parameters needed to analyze the global
properties of biological networks. The average degree is one such parameter and
although this parameter is meaningful in a random network, it does not provide us
with much information in other types of networks. Degree distribution on the other
hand, gives us a general idea about how the nodes are connected. The average path
length or distance shows the easiness of reaching from one node to another. The
clustering coefficient of the network is another important parameter that shows the
probability of the two neighbors of a randomly selected vertex to be connected.
Centrality measures evaluate the importance of a node or an edge in a network and
betweenness centrality values provide more realistic results than others in general.
We then reviewed basic network models which are the ER random network, small-
world, and scale-free network models. Small-world networks are characterized by
small average distance between their nodes and scale-free property of a network
means the number of nodes decreases sharply as their degrees increase. We find most
of the biological networks exhibit small-world and scale-free properties, however, the
clustering coefficients of the nodes in these networks decreases when their degrees
increase as experiments show. This fact allowed researchers to look for a new model
to explain the aforementioned properties of these networks.
The hierarchical model proposed captures all of the observed properties of biolog-
ical networks. This model assumes that the low-degree nodes in biological networks
constitute dense clusters with high clustering coefficients. A high-degree node or
a hub on the other hand has low clustering coefficient and its main function is to
act as gateway between the small dense clusters. It is possible to explain all of the
10.10 Chapter Notes 237
properties observed with this model, for example, a hub in a PPI network serves
as a main switch to transfer signals between the small and dense protein clusters.
It also provides short paths between these clusters, contributing to the small-world
property.
The experiments are performed on samples rather than the whole network for
practical reasons. Two major problems associated with evaluating biological network
data is that the measurements are frequently error-prone with false positive and
false positive interaction detections. Secondly, the sample may not reflect the whole
network. As technology advances, we will have less errors and larger sample sizes
and this problem will have less effect on the accuracy of the measurements.
In the final section, we reviewed the module detection, network motif search
and network alignment problems in biological networks. Module detection involves
discovering the cluster structures which usually represent some important activity in
that part of the network. Network motifs are repeating patterns of small subgraphs in
a biological network which may display basic functions of the network. We can also
compare the motifs found in networks of species to investigate their evolutionary
relationships. In the network alignment problem, we search for similar subgraphs in
two or more species networks to compare them similar to what we did in comparing
DNA or protein sequences. We will investigate these problems and sequential and
distributed algorithms to solve them in detail in the rest of this part of the book.
Exercises
1. Work out the degree sequence and sketch the degree distribution of the sample
graph of Fig. 10.17.
2. Find the clustering coefficients for all of the nodes in the sample graph of
Fig. 10.18. Plot the clustering coefficient distribution for this graph. What does it
show?
3. Compute the degree centrality values for the nodes of the graph in Fig. 10.19 by
first forming the adjacency matrix of this graph and then multiplying this matrix
by a vector of all 1s.
238 10 Analysis of Biological Networks
7. Work out the vertex betweenness centrality values for the vertices of the graph of
Fig. 10.20.
8. Find the edge betweenness centrality values for the vertices of the graph of
Fig. 10.20 for source vertex g.
9. Describe briefly the basic topological properties of biological networks and why
hierarchical network model is a better model than the small-world and scale-free
models to explain the behaviour of these networks.
References
1. Albert R, Othmer HG (2003) The topology of the regulatory interactions predicts the expression
pattern of the drosophila segment polarity genes. J Theor Biol 223:1–18
2. Albert R, Barabasi A (2002) The statistical mechanics of complex networks. Rev Mod Phys
74:47–97
3. Bassett DS, Bullmore E, Verchinski BA et al (2008) Hierarchical organization of human cortical
networks in health and schizophrenia. J Neurosci 28:9239–9248
4. Davidson EH, Rast JP, Oliveri P, Ransick A et al (2020) A genomic regulatory network for
development. Science, 295:1669–1678
5. Eguiluz VM, Chialvo DR, Cecchi GA, Baliki M, Apkarian AV (2005) Scale-free brain func-
tional networks. Phys Rev Lett 94:018102
6. Floyd RW (1962) Algorithm 97: shortest path. Comm ACM 5(6):345
7. Goh K, Kahng B, Kim D (2005) Graph theoretic analysis of protein interaction networks of
eukaryotes. Physica A 357:501–512
8. He Y, Chen Z, Evans A (2010) Graph theoretical modeling of brain connectivity. Curr Opin
Neurol 23(4):341–350
9. He Y, Chen Z, Evans A (2008) Structural insights into aberrant topological patterns of large-
scale cortical networks in Alzheimers disease. J Neurosci 28:4756–4766
10. Identifying gene regulatory networks from gene expression data
11. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL (2000) The large-scale organization of
metabolic networks. Nature 407:651–654
12. Junker B (2008) Analysis of biological networks, Chap. 9. Wiley ISBN: 978-0-470-04144-4
13. Koschtzki D, Lehmann KA, Tenfelde-Podehl D, Zlotowski O (2005) Advanced centrality con-
cepts. Springer-Verlag LNCS Tutorial 3418:83-111, In: Brandes U, Erlebach T (eds) Network
analysis: methodological foundations
14. Kaiser M, Martin R, Andras P, Young MP (2007) Simulation of robustness against lesions of
cortical networks. Eur J Neurosci 25:3185–3192
15. Mason O, Verwoerd M (2007) Graph theory and networks in biology. IET Syst Biol 1(2):89–
119
240 10 Analysis of Biological Networks
16. Pablo Carbonell P, Anne-Galle Planson A-G, Davide Fichera D, Jean-Loup Faulon J-P (2011)
A retrosynthetic biology approach to metabolic pathway design for therapeutic production.
BMC Syst Biol 5:122
17. Pavlopoulos GA, Secrier M, Moschopoulos CN, Soldatos TG, Kossida S, Aerts J, Schneider
R, Bagos Pantelis GPG, (2011) Using graph theory to analyze biological networks. Biodata
Mining 4:10. doi:10.1186/1756-0381-4-10
18. Perron O (1907) Zur Theorie der Matrices. Mathematische Annalen 64(2):248–263
19. Phizicky EM, Fields S (1995) Proteinprotein interactionsmethods for detection and analysis.
Microbiol Rev 59:94–123
20. Ptashne M (1992) A genetic switch: phage lambda and higher organisms, 2nd edn. Cell Press
and Blackwell Scientific
21. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabsi AL (2002) Hierarchical organization
of modularity in metabolic networks. Science 297:1551–1555
22. Rubinov M, Sporns O (2010) Complex network measures of brain connectivity: uses and
interpretations. NeuroImage 52(3):1059–1069
23. Salvador R, Suckling J, Coleman MR, Pickard JD, Menon D, Bullmore E (2005) Neurophys-
iological architecture of functional magnetic resonance images of human brain. Cereb Cortex
15:1332–1342
24. Salwinski L, Eisenberg D (2003) Computational methods of analysis of proteinprotein inter-
actions. Curr Opin Struct Biol 13:377–382
25. Schuster S, Fell DA, Dandekar T (2000) A general definition of metabolic pathways useful for
systematic organization and analysis of complex metabolic networks. Nat Biotechnol 18:326–
332
26. Seidenbecher T, Laxmi TTR, Stork O, Pape HC (2003) Amygdalar and hippocampal theta
rhythm synchronization during memory retrieval. Science 301:846–850
27. Stumpf MPH, Wiuf C, May RM (2005) Subnets of scale-free networks are not scale-free:
Sampling properties of networks. Proc National Acad Sci 102(12):4221–4224
28. Vidal M, Cusick ME, Barabasi AL (2011) Interactome networks and human disease. Cell
144(6):986–998
29. Vitale A (2002) Physical methos. Plant Mol Biol 50:825–836
30. Vogelstein B, Lane D, Levine A (2000) Surfing the p53 network. Nature 408:307–310
31. Watts DJ, Strogatz SH (1998) Collective dynamics of small-world networks. Nature 393:440–
442
Cluster Discovery in Biological
Networks 11
11.1 Introduction
Clustering is the process of grouping similar objects based on some similarity mea-
sure. The aim of any clustering method is that the objects belonging to a cluster
should be more similar to each other than to the rest of the objects under consid-
eration. Clustering is one of the most studied topics in computer science as it has
numerous applications in bioinformatics, data mining, image processing, and com-
plex networks such as social networks, biological networks, and the Web.
We will make a distinction between clustering data points commonly distributed
in 2D plane which we will call data clustering and clustering objects which are
represented as vertices of a graph in which case we will use the term graph clustering.
Furthermore, graph clustering can be investigated as inter-graph clustering where a
subset from a given set of graphs are clustered based on their similarity or intra-graph
clustering in which our object is to find clusters in a given graph. We will assume
the latter when we investigate clustering in biological networks in this chapter.
Intra-graph clustering or graph clustering in short, considers the neighborhood
relationship of the vertices while searching for clusters. In unweighted graphs, we
try to cluster nodes that have strong neighborhood connections to each other and this
problem can be viewed as finding cliques of a graph in the extreme case. Our aim in
edge-weighted graphs, however, is to place neighbors that are close to each other in
the same cluster using some metric.
Biological networks are naturally represented as graphs as we have seen in
Chap. 10, and any graph clustering algorithm can be used to detect clusters in bio-
logical networks such as the gene regulation networks, metabolic networks, and PPI
networks. There are, however, important differences between a graph representing a
general random network and the graph of a biological network. First of all, the size
of a biological network is huge, reaching tens of thousands of vertices and hundreds
of thousands of edges, necessitating the use of highly efficient clustering algorithms
as well as usage of distributed algorithms for this computation-intensive task. Sec-
ondly, biological networks are scale-free with few very high-degree nodes and many
© Springer International Publishing Switzerland 2015 241
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_11
242 11 Cluster Discovery in Biological Networks
low-degree nodes. Third, they exhibit small-world property having small diameters
relative to their sizes. These last two observations may be exploited to design efficient
clustering algorithms with low time complexities but this alone does not provide the
needed performance in many cases and using distributed algorithms is becoming
increasingly more attractive to solve this problem.
Our aim in this chapter is to first provide a formal background and a classification
of clustering algorithms in biological networks. We then describe and review efficient
sample algorithms, most of which are experimented in biological networks and have
distributed versions. In cases where there are no distributed algorithms known to
date, we propose distributed algorithm templates and point potential areas of further
investigation which may lead to efficient algorithms.
11.2 Analysis
We can have overlapping clusters where a node may belong to two or more clusters
or a node of the graph becomes a member of exactly one cluster at the end of a
clustering algorithm, which is called graph partitioning. Also, we may specify the
number of clusters k beforehand and the algorithm stops when there are exactly
k clusters, or it terminates when a certain criteria is met. Another distinction is
whether a node belongs fully to a cluster or with some probability. In fuzzy clustering,
membership of a node to a cluster is specified using a value between 0 and 1 showing
this probability [47].
Formally, a clustering algorithm divides a graph G(V, E) into a number of possibly
overlapping clusters C = C1 , . . . , Ck where a vertex v ∈ Ci is closer to all other
vertices in Ci than to vertices in other clusters. This similarity can be expressed in
a number of ways and a common parameter for graph clustering is based on the
average density of the graph and the densities of the clusters. We will now evaluate
the quality of a graph clustering method based on these parameters.
A basic requirement from any graph clustering algorithm is that the vertices in a
cluster output from the algorithm should be connected which means there will be
at least one path between every vertex pair (u, v) in a cluster Ci . Furthermore, the
path between u and v should be internal to the cluster Ci meaning u is close to v
[40], assuming the diameter of the cluster is much smaller than the diameter of the
graph. However, a more fundamental criteria is based on evaluating the densities of
the graph in a cluster and outside the cluster and comparing them. We will describe
two methods to evaluate these densities next.
11.2 Analysis 243
1
k
δint (G) = δint (Ci ) (11.3)
k
i=1
where k is the number of clusters obtained. For example, the intra-cluster densities
for clusters C1 , C2 , and C3 in Fig. 11.1a are 0.66, 0.33, and 0.5 respectively and the
average intra-cluster density is 0.50. We divide the same graph into different clusters
in (b) with intra-cluster densities of 0.7, 0.5, and 0.6 for these clusters and the average
density becomes 0.6. We can say that the clusters obtained in (b) are better as we
have a higher average intra-cluster density, as can be observed visually. The cut size
of a cluster is the size of the edges between Ci to all other clusters it is connected.
The inter-cluster density δext (G) is defined as the ratio of the size of inter-cluster
edges to the maximum possible size of edges between all clusters as shown below
[40]. In other words, we subtract the size of maximum total possible intra-cluster
edges from the size of the maximum possible edges between all nodes in the graph
to find the size of the maximum possible inter-cluster edges, and the inter-cluster
density should be as low as possible when compared with this parameter.
2 × sum of inter-cluster edges
δext (G) = (11.4)
n(n − 1) − ki=1 (|Ci ||Ci − 1|)
244 11 Cluster Discovery in Biological Networks
(a) (b)
0.33 0.5
0.66 0.7 C2
C2
C1
C1
C3
C3
0.5 0.6
The inter-cluster densities in Fig. 11.1a, b are 0.08 and 0.03 respectively, which
again shows the clustering in (b) is better since we require this parameter to be as
small as possible. The graph density in this example is (2 × 22)/(15 × 14) = 0.21
and based on the foregoing, we can conclude that a good clustering should provide a
significantly higher intra-cluster density than the graph density, and the inter-cluster
density should be significantly lower than the graph density.
When we are dealing with weighted graphs, we need to consider the total weights
of edges in the cut set, as the internal and external edges, rather than the number of
such edges. The density of an edge-weighted graph can be defined as the ratio of
total edge weight to the maximum possible number of edges as follows:
2 (u,v)∈E w(u,v)
ρ(G(V, E, w)) = (11.5)
n(n − 1)
The intra-cluster density of a cluster Ci in such an edge-weighted graph can then
be computed similarly to the unweighted graph but we sum the weights of edges
inside the clusters and divide it by the maximum possible number of edges in Ci
this time. The graph intra-cluster density is the average of intra-cluster densities
of clusters as before and the general requirement is that this parameter should be
significantly higher than the edge-weighted graph density. For inter-cluster density
of an edge-weighted graph, we can compute the sum of weights of all edges between
each pair of clusters and divide it by the maximum possible number of edges between
clusters as in Eq. 11.4 by just replacing the number of edges with their total weight.
We can then compare this value with the graph density as before and judge the quality
of clustering.
11.2 Analysis 245
11.2.1.2 Modularity
The modularity parameter proposed by Newman [35] is a more direct evaluation of
the goodness of clustering than the above described procedures. Given an undirected
and unweighted graph G(V, E) which has a cluster set C = {C1 , .., Ck }, modularity
Q is defined as follows [36]:
k
Q= (eii − ai2 ) (11.6)
i=1
where eii is the percentages of edges in Ci , and ai is the percentage of edges with
at least one edge in Ci . We actually sum the differences of probabilities of an edge
being in Ci and a random edge would exist in Ci . The maximum value of Q is 1 and
a high value approaching 1 shows good clustering. For calculating Q conveniently,
we can form a modularity matrix M which has an entry mij showing the percentage
of edges between clusters i and j. The diagonal elements in this matrix represent the
eii parameter in Eq. 11.6 and the sum of each row except the diagonal is equal to
aij of the same equation. We will give a concrete example to clarify these concepts.
Evaluating the modularity matrices M1 and M2 for Fig. 11.1a, b respectively yields:
⎡ ⎤ ⎡ ⎤
0.18 0.09 0.14 0.32 0.05 0.09
M1 = ⎣ 0.09 0.32 0.14 ⎦ M2 = ⎣ 0.05 0.23 0.05 ⎦
0.14 0.14 0.14 0.09 0.05 0.27
For the first clustering, we can calculate the contributions to Q using M1 from
clusters C1 , C2 and C3 as 0.127, 0.267, and 0.060 giving a total Q value of 0.247.
We can see straight away clustering structure in C3 is worse than others as it has the
lowest score. For M2 matrix of clusters in (b), the contributions are 0.30, 0.22, and
0.25 providing a Q value of 0.77 which is significantly higher than the value obtained
using M1 and also closer to unity. Hence, we can conclude that the clustering in (b)
is much more favorable than the clustering in (a). We will see in Sect. 11.4 that there
is a clustering algorithm based on the modularity concept described.
There are many different ways to classify the clustering algorithms based on the
method used. In our approach, we will focus on the methods used for clustering
in biological networks and provide a taxonomy of clustering algorithms used for
this purpose only as illustrated in Fig. 11.2. We have mostly included fundamental
algorithms in each category that have distributed versions or can be distributed.
We classify the clustering algorithms in four basic categories as hierarchical,
density-based, flow-based, and spectral algorithms. The hierarchical algorithms con-
struct nested clusters at each step and they either start from each vertex being a single
cluster and combine them into larger clusters at each step, or they may start from one
246 11 Cluster Discovery in Biological Networks
MST−based Cliques
MCL
k−cores
Edge−Betweenness
HCS
Modularity−based
cluster including all of the nodes and divide them into smaller clusters in each itera-
tion [28]. The MST-based and edge-betweenness-based algorithms are examples of
the latter hierarchical methods. Density-based algorithms search for the dense parts
of the graph as possible clusters. Flow-based algorithms on the other hand are built
on the idea that the flow between nodes in a cluster should be higher than the rest of
the graph and the spectral clustering considers the spectral properties of the graph
while clustering.
We search for clusters in biological networks to understand their behavior, rather
than partitioning them. However, we will frequently need to partition a graph repre-
senting such a network for load balancing in a distributed memory computing system.
Our aim is to send a partition of a graph to a process in such a system so that parallel
processing can be achieved. The BFS-based partitioning algorithm of Sect. 7.5 can
be used for this purpose. In the next sections, we will investigate sample algorithms
of these methods in sequential and distributed versions in detail.
We have described the basic hierarchical clustering methods in Sect. 7.3. We will now
investigate two graph-based hierarchical clustering approaches to discover dense
regions of biological networks.
11.3 Hierarchical Clustering 247
(a) (b)
C1
2 2
11 11
8 8
1 6 1
6
10 14 10
5 14 5
13 7 7
12 12
9 9
3 3 C2
4 4
(c)
C1
2
11
8
6 1
14 10
5
7 C2
9
3
C3 4
Fig. 11.3 MST-based clustering in a sample graph. MST is shown by bold lines and the edges are
labeled with their weights. The highest weight edge in the MST has weight 13 and removed in the
first step resulting in two clusters C1 and C2 . The next iteration removes the edge with weight 12
and three clusters C1 , C2 , and C3 are obtained
Instead of removing one edge at each iteration of the algorithm, we may start with
a threshold edge weight value τ and remove all edges that have higher weights than
τ in the first step which may result in a number of clusters. We can then check the
quality Q of the clusters we obtain and continue if Q is lower than expected. This
parameter can be the ratio of intra-cluster density to the inter-cluster density or it
can simply be computed as the ratio of the total weight of intra-cluster edges in the
current clusters to the total weight of inter-cluster edges. We may modify the value
of τ as we proceed to refine the output clusters as a large τ value may result in many
small clusters and a small value will generally give few large clusters. MST of a
graph can be constructed using one of the greedy approaches as follows:
• Prim’s Algorithm: This algorithm greedily includes an edge of minimum weight
in MST among edges that are incident on the current MST vertices but not part of
the current MST as we have seen in Sect. 3.6. Prim’s algorithm requires O(n2 ) as
it checks each vertex against all possible vertex connections but this time may be
reduced to O(mlogn) by using the binary heap data structure and to O(m + nlogn)
by Fibonacci heaps [13].
• Kruskal’s Algorithm: Edges are sorted with respect to their weights and starting
from the lightest weight edge, an edge is included in MST if it does not create
a cycle with the existing MST edges. The time for this algorithms is dominated
by the sorting of edges which is O(m log m) and if efficient algorithms such as
union-find are used, it requires O(m log n) time.
• Boruvka’s Algorithm: This algorithm is the first MST algorithm designed to con-
struct an efficient electricity network for Moravia, dating back to 1926 [5]. It finds
the lightest edges for each vertex and contracts these edges to obtain a simpler
graph of components and then the process is repeated with the components of the
new graph until an MST is obtained. It requires O(m log n) time to build the MST.
CLUMP
As we have seen in Sect. 11.5, the vertex betweenness centrality CB (v) of a vertex
v is the percentage of the shortest paths that pass through v. Similarly, the edge
betweenness centrality CB (e) of an edge e is the ratio of the shortest paths that pass
through e to total number of shortest paths. These two metrics are shown below:
σst (v) σst (e)
CB (v) = , CB (e) = (11.7)
σst σst
s=t=v s=t=v
1. Find edge betweenness values of all edges of the graph G(V, E) representing the
network.
2. Remove the edge with the highest edge betweenness value from the graph.
3. Recalculate edge betweennesses in the new graph.
4. Repeat steps 1 and 2 until a quality criteria is satisfied.
The general idea of this algorithm is that an edge e which has a higher edge
betweenness value than other edges has a higher probability of joining two or more
clusters as there are more shortest paths passing through it. In the extreme case,
this edge could be a bridge of G in which case removing it will disconnect G.
It is considered as a hierarchical divisive algorithm since it starts with a single
cluster containing all vertices and iteratively divides clusters into smaller ones. The
fundamental and most time consuming step in this algorithm is the computation of
the edge betweenness values which can be performed using the algorithms described
in Sect. 11.5.
The dense parts of an unweighted graph have more edges than average and exhibit
possible cluster structures in these regions. If we can find methods to discover these
dense regions, we may detect clusters. Cliques are perfect clusters and detecting a
clique does not require any comparison with the density in the rest of the graph.
In many cases, however, we will be interested in finding denser regions of a graph
with respect to other parts of it rather than absolute clique structures. We will first
describe algorithms to find cliques in a graph and then review k-cores, HCS, and
modularity-based algorithms with their distributed versions in this section.
Blaar et al. implemented a parallel version of Bron and Kerbosch algorithm using
thread pools in Java and provided test results using 8 processors [6]. Mohseni-Zadeh
et al. provided a clustering method they called Cluster-C to cluster protein sequences
based on the extraction of maximal cliques [32] and Jaber et al. implemented a parallel
version of this algorithm using MPI [27]. Schmidt et al. provided a scalable parallel
implementation of Bron and Kerbosch algorithm on a Cray XT supercomputer [41].
1−core
3−core
coreness values
1
2
3
2−core
The k-core decomposition of a graph G is to find the k-core subgraphs of G for all
k which can therefore be reduced to finding coreness values of all vertices of G. Core
decomposition has been used for complex network analysis [2] and to detect k-cores
in PPI networks [1]. Detecting group structures such as cliques, k-cliques, k-plexes,
and k-clubs are difficult and NP-hard in many cases, however, finding k-cores of a
graph can be performed in polynomial time as we describe in the next section.
Execution steps of this algorithm in the sample graph of Fig. 11.5 is shown in
Table 11.1.
256 11 Cluster Discovery in Biological Networks
coreness values
i 3−core j
1
b
e a
g 2
c
3
f d
k 2−core 1−core
h
smaller and a denser complex and a low value results in the contrary. The last step is
used to filter and modify the complexes. Complexes that do not have at least a 2-core
are removed during filtering process and the optional fluff operation increases the
size of the complexes according to the fluff parameter which is between 0.0 and 1.0.
The time complexity of this algorithm is O(nmh3 ) where h is the vertex size of the
average vertex neighborhood in G as shown in [3].
There is not a reported parallel or distributed version of this algorithm, however,
the search of dense neighbors can be performed by the BFS method and this step
can be employed in parallel using a suitable parallel BFS algorithm such as in [10].
The highly connected subgraphs (HCS) algorithm proposed by Hartuv and Shamir
[26] searches dense subgraphs with high connectivity rather than cliques in undi-
rected unweighted graphs. The general idea of this algorithm is to consider a subgraph
G of n vertices of a graph G as highly connected if G requires a minimum of n/2
edges to have it disconnected. In other words, the edge connectivity of G , kE (G )
should be n/2 to accept it as a highly connected subgraph. The algorithm shown in
11.4 Density-Based Clustering 259
Algorithm 11.5 starts by first checking if G is highly connected, otherwise uses the
minimum cut of G to partition G into H and H , and recursively runs HCS procedure
on H and H to discover highly connected subgraphs.
The execution of HCS algorithm is shown in a sample graph in Fig. 11.6 after
which three clusters are discovered. HCS has a time complexity of 2N × f (n, m)
where N is the number of clusters discovered and f (n, m) is the time complexity
of finding a minimum cut in a graph that has n vertices and m edges. HCS has
been successfully used to discover protein complexes, and cluster identification via
connecting kernel (CLICK) algorithm is an adaptation of HCS algorithm for weighted
graphs [43].
We have seen that the modularity parameter Q provides a good indication of the
quality of the clustering in Sect. 12.2. The algorithm proposed by Girvan and Newman
C2
C1
C3
Fig. 11.6 HCS algorithm run on a sample graph. Three clusters C1 , C2 and C3 are discovered
260 11 Cluster Discovery in Biological Networks
Parallel and distributed algorithms that find clusters using modularity are scarce.
Gehweiler et al. proposed a distributed diffusive heuristic algorithm for clustering
using modularity [25]. Riedy et al. proposed a massively parallel community detec-
tion algorithm for social networks based on Louvani method [39]. We will describe
this algorithm in detail as it is one of the only parallel algorithms for this purpose.
It consists of the following steps which are repeated until a termination condition is
encountered:
1. Every edge of the graph is labeled with a score. If all edges have negative scores,
exit.
2. Compute a weighted maximal matching using these scores.
3. Coarsen matched groups into a new group which are the nodes of the new graph.
In the first step, the change in optimization metric is evaluated if two adjacent
clusters are merged and a score is associated with each edge. The second step involves
selecting pairs of neighboring clusters merging of which will improve the quality of
clustering using a greedy approximately maximum weight maximal matching and the
selected clusters are contracted according to the matching in the final step. The time
11.4 Density-Based Clustering 261
p0
C1
C8 p1
C7
C5
C2
C4
C10
C3
C6
p2
C9
partitioned into a number of clusters, however, we could have started by the original
graph assuming each node is a cluster. The root performs the following steps:
1. Assume each cluster is a supernode and perform a BFS partition on the original
graph to have k cluster partitions such that C = {C1 , . . . , Ck }.
2. send each partition to a process pi .
3. find the best cluster pair in my partition.
4. receive best cluster pairs from each pi .
5. find the pair Cx , Cy that gives the maximum modularity.
6. broadcast Cx , Cy to all processes.
7. repeat steps 3–5 until modularity starts reducing.
Figure 11.7 displays an example network that already has 10 clusters. The root
process p0 partitions this network by the BFS partitioning algorithm and sends the
two cluster partitions to processes p1 and p2 . In this example, p1 computes the
modularity values for combining operations C2 ∪ C8 , C2 ∪ C10 , C2 ∪ C5 , C5 ∪ C8 ,
C5 ∪ C10 , and C5 ∪ C7 , assuming for any cluster pair across the borders, the process
that owns the lower identifier cluster is responsible to compute modularity. Further
optimizations are possible such as in the case of local cluster operation in a process
is decided to merge C2 and C10 in p1 , the processes p0 and p2 do not need to compute
their modularity values again as they are not affected.
11.5 Flow Simulation-Based Approaches 263
A different approach than the traditional graph clustering methods using density is
considered in flow simulation-based methods. The goal in this case is to predict
the regions in the graph where the flow will gather. The analogy of the graph is a
water distribution network with nodes representing storages, and the edges as the
pipes between them. If we pump water to such a network, flow will gather at nodes
which have many pipes ending in them and hence in clusters. An effective way of
simulating the flow in a network is by using random walks which is described in the
next section.
This operation is equal to M = AD−1 where D is the diagonal matrix of the graph
G. The MCL algorithm inputs the matrix M and performs two iterative operations
on M called expansion and inflation and an additional pruning step as follows.
• Expansion: This operation simply involves taking the eth power of M as below:
Mexp = M e , (11.11)
264 11 Cluster Discovery in Biological Networks
e being a small integer, usually 2. Based on the properties of M, Mexp shows the
distribution of a random walk of length r from each vertex.
• Inflation: In this step, the rth power of each entry in M is computed and this value
is normalized by dividing it to the sum of the rth power of column values as below.
M(i, j)r
Minf (i, j) = n r
(11.12)
k=1 M(k, j)
The idea here is to emphasize the flow where it is large and to decrease it where it
is small. This property makes this algorithm suitable for scale-free networks such
as PPI networks, as these networks have few high-degree hubs and many low-
degree nodes. As clusters are formed around these hubs, emphasizing them and
deemphasizing the low-degree nodes removes extra processing around the sparse
regions of the graph.
• Pruning: The entries which have significantly smaller values than the rest of the
entries in that column are removed. This step reduces the number of nonzero
column entries so that memory space requirements are decreased.
Algorithm 11.6 displays the pseudocode for MCL algorithm [42].
After a number of iterations, there will be only one nonzero element at each
column of M and the nodes that have flows to this node will be interpreted as a single
cluster. The level of clustering can be modified by the parameter r, with the lower r
resulting in a coarser clustering. The time complexity of this algorithm is O(n3 ) steps
since multiplication of two n × n matrices takes n3 time during expansion, and the
inflation can be performed in O(n2 ) steps. The convergence of this algorithm has been
shown experimentally only where the number of rounds required to converge was
between 10–100 steps [3]. The MCL algorithm has been successfully implemented
in biological networks in various studies [8,44], however, the scalability of MCL
especially at the expansion step was questioned in [42]. Also, MCL was found
to discover too many clusters in the same study and a modification to MCL by a
multilevel algorithm was proposed.
11.5 Flow Simulation-Based Approaches 265
The MCL algorithm has two time consuming steps as expansion and inflation
described above. We can see that these are matrix operations which can be per-
formed independently on a distributed memory computing system whether a mul-
tiprocessor or totally autonomous nodes connected by a network. The expansion is
basically a matrix multiplication operation in which many parallel algorithms exist.
The inflation operation yields asynchronous operations and hence can be performed
in a distributed system.
We will now sketch a distributed algorithm to perform MCL in parallel using m
number of distributed memory processes p0 , . . . , pm−1 . The supervisor process p0
controls the overall flow of the algorithm and m−1 worker processes. The supervisor
initializes the matrix M, broadcasts it to m−1 nodes which all perform multiplication
of M by itself using row-wise 1-D partitioning and send back the partial results to the
supervisor. This process now builds the M 2 matrix which can be partitioned again
and sent to workers which will multiply part of it with their existing copy of M. This
process is repeated for t times to conclude the expansion of the first iteration and
for t = 2, finding M 2 will be sufficient. The supervisor can now send the expanded
matrix Mexp by column partitioning it to m − 1 processes each of which simply takes
the eth power of each entry in columns, normalizes them and sends the resulting
columns to the supervisor. The task of the supervisor now is to check whether M
has converged and if this is not achieved a new iteration is started with the seed Mp .
Algorithm 11.7 shows the pseudocode for the distributed MCL algorithm which can
easily be implemented using MPI.
7 6 5 4
We will show the implementation of the distributed MCI algorithm using a sim-
ple example graph of Fig. 11.8 which will also show the detailed operation of the
sequential algorithm.
The M matrix row partitioned by the supervisor for parallel processing using this
graph will be:
⎡ ⎤
0 0.33 0 0 0 0 0 0.33 p0
⎢ 0.5 0 0000 0.33 0.33 ⎥
⎢ ⎥
⎢0 0 0 0 0 0.25 0 0 p1 ⎥
⎢ ⎥
⎢0 0 0 0 0 0.25 0 0 ⎥
M=⎢ ⎢ ⎥
0 0 0 0 0 0.25 0 0 p ⎥
⎢ 2⎥
⎢0 0 1 1 1 0 0.33 0 ⎥
⎢ ⎥
⎣ 0 0.33 0 0 0 0.25 0 0.33 p3 ⎦
0.5 0.33 0 0 0 0 0.33 0
Assuming we have 4 processors p0 , p1 , p2 and p3 ; and p0 is the supervisor; the
row partitioning of M will result in rows 2,3 to be sent to p1 ; rows 4,5 to p2 and 6,7
to p3 . When the partial products are returned to p0 , it will form Mexp shown below:
⎡ ⎤
p0 p1 p2 p3
⎢ 0.33 0.109 | 0 0 |0 0.083 | 0.012 0.109 ⎥
⎢ ⎥
⎢ 0.165 0.383 | 0 0 | 0 0 | 0.083 0.274 ⎥
⎢ ⎥
⎢0 0 | 0.25 0.25 | 0.25 0 | 0.083 0 ⎥
⎢ ⎥
Mexp = ⎢⎢ 0 0 | 0.25 0.25 | 0.25 0 | 0.083 0 ⎥
⎥
⎢0 0 | 0.25 0.25 | 0.25 0 | 0.083 0 ⎥
⎢ ⎥
⎢0 0.109 | 0 0 | 0 0.159 | 0 0.109 ⎥
⎢ ⎥
⎣ 0.33 0.165 | 0.25 0.25 | 0.25 0 | 0.301 0.109 ⎦
0.165 0.274 | 0 0 |0 0.083 | 0.109 0.383
We then do a column partitioning of it and distribute it to processors p1 , p2 and
p3 which will receive columns 2,3; 4,5; and 6,7 consecutively. After they perform
inflation operation on their columns, they will return the inflated columns to p0 which
will form the Minf matrix as below. Although some of the entries start diminishing
as can be seen, convergence has not been detected yet, and p0 will continue with the
next iteration.
11.5 Flow Simulation-Based Approaches 267
⎡ ⎤
0.401 0.044 0 0 0 0.178 0.001 0.047
⎢ 0.099 0.538 0 0 0 0 0.053 0.291 ⎥
⎢ ⎥
⎢0 0 0.25 0.25 0.25 0 0.053 ⎥
⎢ ⎥
⎢0 0 0.25 0.25 0.25 0 0.053 ⎥
Minf =⎢
⎢0
⎥
⎥
⎢ 0 0.25 0.25 0.25 0 0.053 ⎥
⎢0 0.044 0 0 0 0.643 0 0.047 ⎥
⎢ ⎥
⎣ 0.401 0.099 0.25 0.25 0.25 0 0.695 0.047 ⎦
0.099 0.275 0 0 0 0.178 0.092 0.570
As one of the few studies to provide parallel/distributed MCL algorithm, Busta-
mam et al. implemented it using MPI with results showing improved performance
[11]. They also provided another parallel version of MCL this time using graphic
cards processors (GPUs) with many cores [12].
Spectral clustering refers to a class of algorithms that use the algebraic properties
of the graph representing a network. We have noted that the Laplacian matrix of
a graph G is L = D − A in unnormalized form, with D being the diagonal degree
matrix which has di as the degree of the vertex i in its diagonal and A is the adjacency
matrix. In normalized form, the Laplacian matrix L = I − D−1/2 AD−1/2 and L has
interesting properties that can be analyzed to find the connectivity information about
a graph G. First of all, the eigenvalues of L are real as L is real and symmetric. The
second eigenvalue is called the Fiedler value and the corresponding eigenvector for
this eigenvalue, the Fiedler vector [21] provides connectivity information about the
graph G. Using the Fiedler vector, we can partition a graph G into two balanced
partitions in spectral bisection as follows [19]. We first construct the Fiedler vector
and then compare each entry of this vector with a value s, if the entry F[i] ≤ s then
the corresponding vertex of G, vi , is put in partition 1 and otherwise it is placed in the
second partition as shown in Algorithm 11.8. The variable s could simply be 0 or the
median of the Fiedler vector. Figure 11.9 displays a simple graph that is partitioned
using the value of 0.
C1 C2
1 4
2 6
3 5
Fig. 11.9 Partitions formed using Fiedler vector. The first three elements Fiedler vector have values
smaller or equal to zero and are put in the first partition and the rest are placed in the second
268 11 Cluster Discovery in Biological Networks
Newman also proposed a method based on the spectral properties of the modular-
ity matrix Q [37]. In this method, the eigenvector corresponding to the most positive
eigenvalue of the modularity matrix is first found and the network is divided into
two groups according to the signs of the elements of this vector. Spectral bisection
provides two partitions and can be used to find a k-way partition of a graph when exe-
cuted recursively. Spectral clustering however, is more general than spectral bisection
and finds the clusters in a graph directly. This method is mainly designed to cluster n
data points x1 , . . . , xn but can be adapted for graphs as it builds a similarity matrix S
which have entries sij showing how similar two data points xi and xj are. The spectral
properties of this matrix are then investigated to find clusters of data points and the
normalized Laplacian matrix, L = I − D−1/2 SD−1/2 can then be constructed. A
spectral clustering algorithm consist of the following steps [14]:
1. Construct similarity matrix S for n data points.
2. Compute the normalized Laplacian matrix of S.
3. Compute the first k eigenvectors of L and form the matrix V with columns as
these eigenvectors.
4. Compute the normalized matrix M of V .
5. Use k-means algorithm to cluster n rows of M into k partitions.
The k-means algorithm is a widely used method to cluster data points as we
reviewed in Sect. 7.3. The initial centers c1 , . . . , ck can be chosen at random initially
and the distance of data points to these centers are calculated and each data point is
assigned to the cluster that it is closest. The spectral clustering algorithm described
requires significant computation power and memory space for matrix operations
and also to run the k-means algorithm due to the sizes of matrices involved. A
simple approach would involve computation of the similarity matrix S using row
partitioning. Finding eigenvectors can also be parallelized using parallel eigensolvers
and the final step of using the k-means algorithm can also be parallelized [30]. A
distributed algorithm based on the described parallel operations was proposed by
Chen et al. and they experimented this algorithm with two large data sets using MPI
and concluded it is scalable and provides significant speedups [14].
11.7 Chapter Notes 269
Graph clustering is a well-studied topic in computer science and there are numerous
algorithms for this purpose. Our focus in this chapter was the classification and
revision of fundamental sequential and distributed clustering algorithms in biological
networks. We have also proposed two new distributed algorithms which can be
implemented conveniently using a distributed programming environment such as
MPI.
Two types of hierarchical clustering algorithms, MST-based and edge between-
ness-based algorithms have found applications in clustering of biological networks
more than other algorithms. Finding the MST of a graph using Boruvka’s algorithm
can be parallelized conveniently due to its nature. A different approach is taken in
the CLUMP algorithm where the graph is partitioned into a number of subgraphs
and bipartite graphs are formed. The MSTs for the partitions and the bipartite graphs
are formed in parallel and then merged to find the MST of the whole graph. The
edge-betweenness algorithm removes the edge with the highest betweenness value
from the graph at each iteration to divide it into clusters. The algorithm proposed by
Yang and Lonardi partitions the graph into processors which find pair dependencies
by running BFS in parallel on their local partitions.
Density-based clustering algorithms aim to discover dense regions of a graph as
these areas are potential clusters. Cliques are one example of such dense regions,
however, clique-like structures such as k-cliques, k-cores, k-plexes, and k-clubs are
more frequently found in biological networks than full cliques due to the erroneous
measurements and dynamicity in such environments. Out of these structures, only k-
cores of a graph can be found in linear time and therefore our focus was on sequential
and distributed algorithms for k-core decomposition of graphs. The MCODE algo-
rithm which uses a combination of clustering coefficient parameter and the k-core
concept is successfully used in various PPI networks. There is not a reported par-
allel/distributed version of MCODE algorithm and this may be a potential research
area as there are possibilities of independent operations such as finding weights of
vertices. The k-core based algorithms are widely used to discover clusters in com-
plex networks such as the biological networks and the Internet, and we reviewed a
distributed k-core algorithm. The modularity concept provides a direct evaluation of
the quality of clusters obtained and has formed the basis of various clustering algo-
rithms used in social and biological networks. We proposed as simple distributed
modularity-based algorithm that can be used for any complex network including
biological networks.
Flow-based algorithms consider the flows in the networks and assume flows gather
in clusters. An example algorithm with favorable performance that has been exper-
imented in PPI networks is the MCL algorithm which we reviewed in sequential
form and proposed its distributed version which can be implemented easily. Lastly,
we described spectral bisection and clustering based on the Laplacian matrix of a
graph and showed ways to implement distributed spectral clustering.
In summary, we can state most of these algorithms perform well in biological
networks. However, each have merits and demerits and complexities of a number of
270 11 Cluster Discovery in Biological Networks
C2
C1
C3
C4
them have only been determined experimentally. Distributed algorithms for this task
are very rare and this may be a potential research area for researchers in this field.
Exercises
1. Find the intra-cluster and inter-cluster densities of the graph of Fig. 11.10. Do
these values indicate good clustering? What can be done to improve this cluster-
ing?
2. Work out the modularity value in the example graph of Fig. 11.11 based on three
clusters C1 , C2 , and C3 and determine which merge operation is to be done to
improve modularity. We need to check combining each cluster pair and decide
on the pair that improves modularity by the largest amount.
3. Show the output of the BFS-based graph partitioning algorithm on the example
graph of Fig. 11.12 in the first iteration, with the root vertex s. Then, partition the
resulting graphs again to obtain four partitions and validate the partitions in terms
of the number of vertices in each partition and the size of the minimum edge cut
between them.
11.7 Chapter Notes 271
17
4. Work out the MST of the weighted graph of Fig. 11.13 using Boruvka’s algorithm.
In the second step, partition the graph into two processor p1 and p2 and provide a
distributed algorithm in which p1 and p2 will form the MSTs in parallel. Show also
the implementation of the distributed Boruvka algorithm in this sample graph.
5. Find the edge betweenness values for all edges in the graph of Fig. 11.14 and
partition this graph into 3 clusters using Newman’s edge betweenness algorithm.
272 11 Cluster Discovery in Biological Networks
7 6 5 4
6. For the example graph of Fig. 11.15, implement Batagelj and Zaversnik algorithm
to find the coreness values of all vertices. Show all iterations of this algorithm
and compose the k-cores for this graph in the final step.
7. Work out the two iterations of Markov Clustering Algorithm (MCL) in the exam-
ple graph of Fig. 11.16. Is there any display of the clustering structure?
References
1. Altaf-Ul-Amin Md et al (2003) Prediction of protein functions based on kcores of protein-
protein interaction networks and amino acid sequences. Genome Inf 14:498–499
2. Alvarez-Hamelin JI, DallAsta L, Barrat A, Vespignani A (2006) How the k-core decomposition
helps in understanding the internet topology. In: ISMA workshop on the internet topology
3. Bader GD, Hogue CWV (2003) An automated method for finding molecular complexes in
large protein interaction networks. BMC Bioinform 4(2)
4. Batagelj V, Zaversnik M (2003) An O(m) algorithm for cores decomposition of networks.
CoRR (Computing Research Repository), arXiv:0310049
5. Boruvka O (1926) About a certain minimal problem. Prce mor. prrodoved. spol. v Brne III (in
Czech, German summary) 3:37–58
6. Blaar H, Karnstedt M, Lange T, Winter R (2005) Possibilities to solve the clique problem by
thread parallelism using task pools. In: Proceedings of the 19th IEEE international parallel and
distributed processing symposium (IPDPS05)—Workshop 5—Volume 06 in Germany
7. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities
in large networks. J Stat Mech: Theory Exp (10):P10008
8. Brohee S, van Helden J (2006) Evaluation of clustering algorithms for protein-protein interac-
tion networks. BMC Bioinform 7:488. doi:10.1186/1471-2105-7-488
9. Bron C, Kerbosch J (1973) Algorithm 457: finding all cliques of an undirected graph. Commun
ACM 16:575–577
10. Buluc A, Madduri K (2011) Parallel breadth-first search on distributed memory systems. CoRR
arXiv:1104.4518
11. Bustamam A, Sehgal MS, Hamilton N, Wong S, Ragan MA, Burrage K (2009) An efficient
parallel implementation of Markov clustering algorithm for large-scale protein-protein inter-
action networks that uses MPI. In: Proceedings of the fifth IMT-GT international conference
mathematics, statistics, and their applications (ICMSA), pp 94–101
12. Bustamam A, Burrage K, Hamilton NA (2012) Fast parallel Markov clustering in bioinformatics
using massively parallel computing on GPU with CUDA and ELLPACK-R sparse format.
IEEE/ACM Trans Comp Biol Bioinform 9(3):679–691
13. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The
MIT Press
14. Chen W-Y, Song Y, Bai H, Lin C-J, Chang EY (2011) Parallel spectral clustering in distributed
systems. IEEE Trans Pattern Anal Mach Intell 33(3):568–586
References 273
15. Cheng J, Ke Y, Chu S, Ozsu MT (2011) Efficient core decomposition in massive networks.
In: ICDE’11 proceedings of the 2011 IEEE 27th international conference data engineering, pp
51–62
16. Clauset A, Newman ME, Moore C (2004) Finding community structure in very large networks.
Phys Rev E 70(6):066111
17. Du Z, Lin F (2005) A novel approach for hierarchical clustering. Parallel Comput 31(5):523–
527
18. Dongen SV (2000) Graph clustering by flow simulation. PhD Thesis, University of Utrecht,
The Netherlands
19. Elsnern U (1997) Graph partitioning, a survey. Technical report, Technische Universitat Chem-
nitz
20. Erciyes K (2014) Complex networks: an algorithmic perspective. CRC Press, Taylor and Fran-
cis. SBN-13: 978-1466571662, ISBN-10: 1466571667, Chap. 8
21. Fiedler M (1989) Laplacian of graphs and algebraic connectivity. Comb Graph Theory 25:57–
70
22. Fortunato S (2010) Community detection in graphs. Phys Rep 486(3):75–174
23. Garey MR, Johnson DS (1978) Computers and intractability: a guide to the theory ofNP-
completeness. Freeman
24. Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc
Natl Acad Sci USA 99:7821–7826
25. Gehweiler J, Meyerhenke H (2010) A distributed diffusive heuristic for clustering a virtual
P2P supercomputer. In: Proceedings of the 7th high-performance grid computing workshop
(HGCW10) in conjunction with 24th international parallel and distributed processing sympo-
sium (IPDPS10). IEEE Computer Society
26. Hartuv E, Shamir R (2000) A clustering algorithm based on graph connectivity. Inf Process
Lett 76(4):175–181
27. Jaber K, Rashid NA, Abdullah R (2009) The parallel maximal cliques algorithm for protein
sequence clustering. Am J Appl Sci 6:1368–1372
28. Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 2:241–254
29. Kernighan BW, Lin S (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst
Tech J 49(2):291–307
30. Kraj P, Sharma A, Garge N, Podolsky R, Richard A, Mcindoe RA (2008) ParaKMeans: imple-
mentation of a parallelized K-means algorithm suitable for general laboratory use. BMC Bioin-
form 9:200
31. LaSalle D, Karypis G (2015) Multi-threaded modularity based graph clustering using the
multilevel paradigm. J Parallel Distrib Comput 76:66–80
32. Mohseni-Zadeh S, Brezelec P, Risler JL (2004) Cluster-C, an algorithm for the large-scale
clustering of protein sequences based on the extraction of maximal cliques. Comput Biol
Chem 28:211–218
33. Montresor A, Pellegrini FD, Miorandi D (2013) Distributed k-Core decomposition. IEEE Trans
Parallel Distrib Syst 24(2):288–300
34. Murtagh F (2002) Clustering in massive data sets. Handbook of massive data sets, pp 501–543
35. Newman MEJ (2004) Fast algorithm for detecting community structure in networks. Phys Rev
E 69:066133
36. Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks.
Phys Rev E 69:026113
37. Newman MEJ (2006) Finding community structure in networks using the eigenvectors of
matrices. Phys Rev E 74:036104
38. Olman V, Mao F, Wu H, Xu Y (2009) Parallel clustering algorithm for large data sets with
applications in bioinformatics. IEEE/ACM Trans Comput Biol Bioinform 6:344–352
39. Riedy J, Bader DA, Meyerhenke H (2012) Scalable multi-threaded community detection in
social networks. In: 2012 IEEE 26th international parallel and distributed processing sympo-
sium workshops and PhD forum (IPDPSW), IEEE, pp 1619–1628
274 11 Cluster Discovery in Biological Networks
12.1 Introduction
Network motifs are the building blocks of various biological networks such as
the transcriptional regulation networks, protein–protein interaction networks and
metabolic networks. Discovering network motifs gives insight to system level func-
tions in such networks. Given a graph G that represents a biological network, a
network motif m is a small recurrent and connected subgraph of G that is found in a
greater frequency in G than expected in a random graph with a similar structure to G.
A motif is assumed to have some biological significance and believed to perform a
specific function in the network; it is widely accepted that there is a relation between
its structure and its function in general.
Searching for network motifs provides analysis of the basic functions performed
by a biological network and also helps to understand the evolutionary relationships
between organisms. Conserved motifs in a PPI network allow protein–protein inter-
action prediction [1] and a conserved motif in two or more organisms may have
similar functions in them. For instance, identical motifs found in transcriptional
interaction network of E. coli and the yeast S. cerevisiae, and may that mean com-
mon functions are carried by these motifs [14]. Discovery of network motifs is a
computationally difficult task as the number of possible motifs grows exponentially
with the size of the motif, which is expressed by the number of vertices contained
in it. The motif discovery process consists of three basic steps: discovery of fre-
quently appearing subgraphs of the target graph representing the biological network;
identification of topologically equivalent subgraphs of a given size in the network,
which is the graph isomorphism problem in NP but not known to be NP-complete.
The subgraph isomorphism problem, however, is NP-complete [7]. A set of random
graphs with a similar topology needs to be constructed, and the motifs should be
searched in these graphs to determine the statistical significance of the motifs found,
as the last step. The statistical testing involves computing p-value and z-score of the
motif. There are various algorithms for this purpose which can be broadly classified
as exact counting in which every occurrence of a subgraph is discovered in the target
© Springer International Publishing Switzerland 2015 275
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_12
276 12 Network Motif Search
graph, and sampling-based methods where a sample of the graph is searched. Exact
methods are time-consuming and are limited in their motif size, whereas sampling-
based methods provide approximate results with lower time complexities. Parallel
and distributed algorithms for motif discovery are very scarce as we will see.
It would be right to note that the network motif concept has some criticism. It
was argued in [2] that concepts such as the preferential attachment may lead to
exhibition of motifs which may not have attributed functionality in them, and this
was answered by Milo et al. [19] stating that the discovery of motifs should consider
subgraph significance profiles as well as the frequency of the subgraph patterns as
was also noted in [34].
In this chapter, we first define the motif discovery problem formally, classify
algorithms for this purpose, and state parameters for measuring the significance of
motif occurrence. We then briefly review and compare the main sequential motif
discovery algorithms. Our emphasis is later on parallel and distributed algorithms
for motif discovery where we review the few existing approaches and propose general
guidelines for potential distributed algorithms.
(a) (b)
Fig. 12.2 Directed motifs found in biological networks, a Feed-forward-loop, b Bifan, and c Multi-
input motifs
Certain motifs are found in abundance in some biological networks. For example,
the feed-forward-loop (FFL) and bifan motifs are shown in Fig. 12.2 were discovered
in transcriptional regulatory networks and neuronal connectivity networks. The FFL
performs a basic function by controlling connections and signaling.
There are three subtasks involved in finding a motif of size k (m k ) in a target graph
G as follows.
1. Finding all instances of m k in G: This step can be done by exact counting methods
or sampling-based methods. The former requires enumeration of all subgraphs
which has high time complexity. Sampling-based methods implement the algo-
rithm in a sampled subgraph of the target graph and find approximate solutions
in a much less time than exact methods.
2. Determination of the isomorphic classes of the discovered subgraphs: Some of
the found subgraphs may be equivalent which requires grouping of them into
equivalent isomorphic classes.
3. Determination of the statistical significance of the discovered subgraphs: A set of
random graphs that have similar structures to the target graph need to be generated
and the steps above should be performed on this set. Statistical comparisons of
the results in the target graph and in these graphs will then show whether the
discovered subgraphs in the target graph are actual motifs.
(a) (b)
a a’
c c’
b d’
d b’
Fig. 12.3 a Two isomorphic graphs where vertex x is mapped to vertex x for all vertices. b The
small graph is isomorphic to the subgraph of the larger graph shown in bold. The subgraph may not
be induced as shown in this example
that certain subgraphs of this size are motifs in G 1 . We can then search for these
motifs in another graph G 2 which represents a similar network N2 to N1 , and may
conclude that these networks have similar subgraphs, may have evolved from the
same ancestors and also have similar functionality.
Informally, two isomorphic graphs are the same graph that looks different. Graph
isomorphism can be defined as follows.
Definition 12.1 (Graph Isomorphism) Two graphs G 1 (V1 , E 1 ) and G 2 (V2 , E 2 ) are
isomorphic if there is a one-to-one and onto function f : V1 → V2 such that
(u, v) ∈ E 1 ⇔ ( f (u), f (v) ∈ E 2 ).
If any two vertices u and v of the graph G 1 with an edge between them are
mapped to vertices w and x vertices of G 2 ; there should be an edge between w and
x. Given two graphs G 1 (V1 , E 1 ) and G 2 (V2 , E 2 ), with G 1 being a smaller graph
then G 2 , the subgraph isomorphism problem is to search for a subgraph G 3 of G 2
with maximal size that is isomorphic to G 1 . Figure 12.3 displays these concepts. The
decision version of this problem is to determine whether G 1 is isomorphic to any
subgraph of G 2 . It can be shown this problem is NP-complete by reduction from the
clique problem, in which given an input graph G and an integer k, we check whether
G contains a clique with k vertices.
Two graphs G 1 and G 2 with corresponding adjacency matrices A1 and A2 are
isomorphic if there exists a permutation matrix P such that A2 = P A1 P T . A per-
mutation matrix is obtained by permutating rows and columns of the identity matrix
12.2 Problem Statement 279
(a) (b)
Fig. 12.4 Motif frequency concepts. a Feed-forward-loop and b A sample graph. The motifs found
using F1 are {(a, b, c), (a, e, f ), (a, e, c), (b, e, c), (g, a, e)}; one of the possible sets using F2 are
{(a, b, c), (a, e, f )}; and using F3 is any one of the motifs of F1
I of the same size, and therefore has exactly a single 1 in each row and column.
Ullmann’s algorithm was an early effort to find subgraph or graph isomorphism
between two graphs using the permutation matrix [31]. It progressively generates a
permutation matrix to the size of the small graph, and the equation above is checked
to find isomorphism. The nauty algorithm proposed by McKay [15,20] also finds
subgraph isomorphism using vertex invariants such as k-cliques and the number of
vertices at certain distances from a vertex and group theory and is used in network
motif search algorithms such as FANMOD [33].
While searching for all subgraphs of a given size in the target network, we need to
classify the found subgraphs into isomorphic classes which is the graph isomorphism
problem in NP. If we know which motif to search in the target network, the problem
is slightly different, we need to deal with the subgraph isomorphism problem in
general which is NP-complete.
(a) (b)
Fig.12.5 Random graph generation using Monte-carlo method. a The original graph, b The random
graph generated using edges (h, g) and (c, d). These edges are deleted and the newly formed edges
are (h, d) and (c, g)
The frequencies of a motif m should be computed in both the target graph G and a
set of k random graphs R = R1 , . . . , Rk . In many motif searching methods, random
graphs are typically generated by the Monte-Carlo algorithm, sometimes referred to
as the switching method. At each iteration of this algorithm, two random edges (a, b)
and (c, d) are selected and these edges are replaced by two news edges (a, d) and
(b, c), preserving the in-degrees and out-degrees of the vertices. If there are multiple
edges or loops are detected as a result of this modification, this step is discarded.
This process is repeated sufficient enough to obtain a random graph G which is
quite different than G but has the same topological properties such as the number of
vertices and edges, and degree sequence. Exchange of two edges using this method
is shown in Fig. 12.5. Members of the set R is formed using this algorithm and the
motif searching can then be performed on the graphs of this set to determine the
statistical significance of the subgraphs found.
if P(m) < 0.01. This parameter is determined by finding the number of elements
of R that have more frequency of m than in G. This number is divided by the size
of the set R as follows.
1
n
P(m) = σ Ri (m) (12.1)
n
i=1
in which σ Ri (m) is set to 1 if the occurrence of motif m in the random network
Ri ∈ R is higher and 0 if lower than in G.
• Z -score: If motif of m occurs Fm times in the network G, and F r and σr2 are the
mean and variance frequencies of m in a set of random networks, and Z -score of
a motif m k , Z (m), is the ratio of the difference between the frequency Fm of m in
the target network and its average frequency F r in a set of randomized networks,
to the root of the standard deviation σr2 of the FG (m) values in a set of randomized
networks as shown below. A Z (m) value that is greater than 2.0 means the motif
is significant [11].
Fm − F r
Z (m) = (12.2)
σr2
• Motif significance profile: The motif significance profile vector (SP) is formed
having elements as Z -scores of a set of motifs m 1 , m 2 , . . . , m k and normalized to
unity as shown in Eq. 12.3. It is used to compare various networks considering the
motifs that exist in them.
Z (m i )
S P(m i ) = n (12.3)
i=1 Z (m i )
2
The type of algorithm depends on whether it is network centric or motif centric; and
whether it is exact census or sampling based as shown in the classification of Fig. 12.6.
Since the number of subgraphs grows exponentially both with the size of the network
and the size of the subgraph investigated, the exact census algorithms demand high
computational power and their performance degrade with the increased motif size. An
efficient way to overcome this problem is to employ probabilistic algorithms where a
number of subgraphs of required size are sampled in the target graph and in randomly
generated graphs, and the algorithms are executed in these samples rather than the
entire graphs. The accuracy of the method employed will increase with the number
of samples used; however, we need to ensure that these subgraphs are qualitative
representatives of the original graphs to provide the results with reasonable certainty.
Mfinder, and ESU algorithms are exact motif centric algorithms with sampling-
based versions, whereas Kavosh has only exact census version. Grochow–Kellis
282 12 Network Motif Search
Network Motif
Search
and MODA algorithms are motif centric with MODA also having a sampling-based
version. We will briefly review all of these algorithms in the next sections.
Network centric motif search evaluates all possible motifs in the target graph and
groups them into isomorphic classes. The isomorphic testing can be done using
different methods but the nauty algorithm [15,20] is commonly used in various
applications. Mfinder, ESU and Kavosh algorithms all implement network centric
methods as described next.
each vertex. This function performs recursive calls adding neighboring vertices to
extend Vsub until the required motif size k is reached.
When a new node is selected for expansion, it is removed from the possible
extensions, and its exclusive neighbors are added to the new possible extensions.
This way, the algorithm ensures that each subgraph will be enumerated exactly once
since the nonexclusive nodes will be considered in the execution of another recursion.
Figure 12.7 displays the execution of this algorithm in a sample graph of six nodes
labeled 1, . . . , 5, and the output consists of four triads and a triangle as shown. The
FANMOD tool incorporates ESU algorithm uses the naut y algorithm to calculate
isomorphisms of graphs [33].
Algorithm 12.2 E SU
1: Input : G(V, E), int 1 ≤ k ≤ n
2: Output : All k-size subgraphs of G
3: for all v ∈ V do
4: Vext ← {u ∈ N (v) : label(u) > label(v)}
5: E xt Sub({v}, Vext , v)
6: end for
7: return
8: procedure ExtSub(Vsub , Vext , v)
9: if |Vsub | = k then output G[Vsub ]
10: return
11: end if
12: while Vext = Ø do
13: Vext ← Vext \ {a random vertex w ∈ Vext }
← V
14: Vext ext ∪ {u ∈ Nexcl (w, Vsub ) : label(u) > label(v)}
15: , v))
E xt Sub((Vsub ∪ {w}, Vext
16: end while
17: return
18: end procedure
1 2 3
or in G
4 5
root
({1,2},{3,5}) ({3,5},{4})
({2,3,5},{4})
2 5
2 5
3
3 1 3 6
1 5 2 2 4
5
The author has implemented RAND-ESU and m f inder (ESA) using sampling
in C++ with instances COLI (transcriptional network of E. Coli [30]), YEAST
(transcriptional network of S. Cereviciae [18]), ELEGANS (neuronal network of
Caenorhabditis Elegans [12]), and ythan (food web of the YTHAN estuary [36]). It
was found that RAND-ESU is much faster than sampling-based m f inder reaching
several orders of magnitude of better performance for graphs of size k ≥ 5. The two
versions of RAND-ESU called fine-grain and coarse-grain based on the value of
the pd were used in tests. The fine-grain RAND-ESU provided consistent sampling
quality, whereas ESA had a very good sampling quality in some ranges. Another
advantage of RAND-ESU is reported as being unbiased and ability to estimate the
total number of subgraphs.
of motif frequencies. The authors tested the method in the metabolic pathway of
the bacteria E. coli and the transcription network of yeast S. cerevisiae, a real social
network and an electronic network, and compared their results to FANMOD. They
concluded Kavosh has less memory usage and is faster than FANMOD for motif
sizes 6, 7, and 8.
Motif centric algorithms input typically a single but sometimes a set of motifs of a
given size k and search for these motifs in the input network rather than enumerating
all subgraphs of size k. They are commonly used in situations where the motif to be
searched is known beforehand. Two representative methods of this class are MODA
and Grochow–Kellis algorithms as described next.
The last two items are also applicable to network centric methods. The main idea
of the algorithm is progressively mapping the desired query on the target graph. Gro-
chow and Kellis proposed a novel symmetry-breaking technique to avoid repeated
isomorphism tests. They also consider degrees of nodes and the degrees of their
neighbors to improve isomorphism tests using subgraph hashing where subgraphs of
a given size are hashed based on their degree sequences. Algorithm 12.3 shows the
pseudocode for this algorithm as adapted from [8], where finding all of the instances
of the query graph M in the target graph G is described.
12.3 A Review of Sequential Motif Searching Algorithms 287
The first step of the algorithm is calculating equivalence classes M E of the query
graph to map only one representative of each class. Then symmetry-breaking condi-
tions C are calculated to avoid counting the same subgraph for multiple times. After
this process, the algorithm tries to match every vertex g of the graph G into one of
the vertices of M. If the algorithm finds a partial map f from M to G, the IsoExt
function is called to find all isomorphic extensions of f satisfying the conditions C.
This function recursively finds the most constrained neighbor of g and adds it into
the mapping if it does not violate the symmetry conditions, and it has appropriate
degree sequence until the whole query graph is mapped. GK algorithm can be used
with gtools package [15] to generate all possible subgraphs of a given size and count
the frequency of each subgraph.
The authors implemented this algorithm in a PPI network [9] and a transcrip-
tion network of S. cerevisiae [5], and compared its performance to other methods
of motif discovery. They showed that GK algorithm provides an exponential time
improvement when compared with mfinder when subgraphs up to size 7 are counted.
Similarly, the memory space was used exponentially less than mfinder as the memory
requirement of the algorithm is proportional to the size of the query graph, whereas
the compared algorithm requires space proportional to the number of subgraphs of a
given size. The main advantage of GK algorithm is reported as having a symmetry-
breaking method which ensures each subgraph instance is discovered exactly once.
Using this method, the authors report of finding significant structures of 15 and 20
nodes in the PPI network and the regulatory network of S. cerevisiae and also the
re-discovery of the cellular transcription machinery, as a 29-node cluster of 15-node
motifs, based solely on the structure of the protein interaction network.
288 12 Network Motif Search
12.3.2.2 MODA
MODA is another motif centric algorithm proposed by Omidi et al. [21]. It uses
pattern growth method which is the general term referring to extending subgraphs
until the desired size is reached. At each step of the algorithm, the frequency of
the query graph is searched in the target network. This algorithm makes use of
information about previous subgraph query searches. If the current query is a super-
graph of a previous query, the previous mapping can be used. A hierarchical structure
is needed to implement the subgraph and super-graph relationships between the
queries. The expansion tree Tk structure introduced for this purpose has query graphs
as nodes and as the depth of Tk increases, the query graphs become more complete.
The expansion tree Tk for size k has the following properties [21].
Figure 12.8 displays the expansion tree T4 for 4-node graphs. An edge is added at
each level to form a child graph. All of the graphs at each level are nonisomorphic
to prevent redundancy. The depth of the tree Tk is the depth of the node with the
complete graph of k nodes (K k ).
The subgraph frequency calculation uses the expansion tree Tk in a bottom-up
direction and employs two methods called mapping and enumerating. A BFS traver-
sal is performed on Tk to fetch the graphs stored at is nodes. First, the trees at first
level of Tk are fetched, and all mappings from these trees to the network are formed
by the mapping module which are saved for future reference. In the second step,
the query graphs in the second level are fetched and processed by an enumerating
module to compute the appearance numbers of these graphs. This process contin-
ues until the complete graph K k is reached. The subgraph frequencies computed
using mapping and enumeration modules are checked against a threshold Δ and
any exceeding values are added to the frequent subgraph list. The pseudocode of the
MODA subgraph frequency finding algorithm is shown in Algorithm 12.4 as adapted
from [21].
12.3 A Review of Sequential Motif Searching Algorithms 289
1 2 1 2
0 3 0 3
1 2 1 2
0 3 0 3
1 2
0 3
1 2
0 3
The value of m in the mapping module is a predefined value to show the number
of samples of the network. All possible mappings are computed between lines 18 and
24. All isomorphic mappings are found in line 21 by the branch and bound method
as in GK algorithm. The enumeration module starts by finding the parent node of G
and loading its parent set FH found in the previous step. The new edge added to G
by its parent is determined in line 33 and the mapping to support the selected criteria
is decided in lines 34–38.
MODA with Sampling
Since the mapping module of MODA uses a lot of time, the sampling version of
MODA is proposed for better efficiency [21]. In this version, the time-consuming
mapping module is modified so as to select only a sample of nodes from the network.
Sampling should provide reliable results, that is, the number of mappings for the
input query must not vary significantly by the different runs of the algorithm. The
number of subgraphs that a node is included is proportional to its degree as shown
by experiments. Therefore, selecting a node with a probability related to its degree
12.3 A Review of Sequential Motif Searching Algorithms 291
for sampling will provide the required reliability. The sampling MODA based on
this concept is shown in Algorithm 12.5 [23].
MODA requires substantial memory space as all k-subgraphs are stored in mem-
ory. Sampling is more efficient at the cost of decreased accuracy. In [21], both sam-
pling and nonsampling versions of the algorithm were experimented and compared
with GK algorithm, mfinder, FANMOD and FPF [28]. The tests were carried with
E. coli transcription network data. Comparing MODA with sampling-based MODA
shows sampling MODA is faster and can detect motifs up to size of 9. When all
algorithms were tested in terms of time taken for subgraph sizes, MODA turned out
to be second fastest after FANMOD, and the size of the subgraph reached was 9 with
MODA whereas all others reached the subgraph size 8 as maximum.
Parallel and distributed algorithms for motif discovery are scarce. Before we review
few reported studies of distributed motif search, we will describe general ways of
parallelizing the three steps of motif search.
The main steps of motif search and possible parallel processing in these steps as
follows.
292 12 Network Motif Search
1. Subgraph enumeration: For network centric motif search, we can partition the
network graph and distribute it evenly to k processes po , . . . , pk−1 . The root
processes p0 can perform the partition and for edges incident across the partitions,
ghost vertices concept in which border vertices are duplicated in each partition
can be used. Each pi has information about these vertices and using a simple
labeling scheme, the processing of the boundary edges can be done, for example,
by the lowest identifier processes that has the ghost vertex.
Alternatively, we can distribute all of the network graph to all processes and the
root process sends a set of specific motifs of a given size to each process pi which
performs a motif centric search for the specific motifs it receives. It then reports
the results to the root which can combine all of the outputs from all processes.
For motif centric approach, we can simply partition and distribute the graph
and have each process pi search for the specific motif in its partition. Similar
strategies can be employed for sampling-based motif searches. For example, the
whole graph can be distributed to all processes, and each process can perform its
own sampling and searching motifs.
2. Finding isomorphic graphs: This step can be performed in parallel as a contin-
uation of the first step. For the case of the partitioned graph, each process can
also find the isomorphic classes of the motifs it has discovered using an algorithm
such as nauty. If the graph is sent as a whole to each process, the same isomorphic
classification can be done on the motifs found.
3. Evaluating statistical significance: In this case, we need first to generate m number
of randomly graphs R1 , . . . , Rm using an algorithm such as Markov chain. We
need then to find subgraphs as in step 1 and find the frequencies in the original
graph and in these randomly graphs. In the simplest case, we can have each of
the k processes generate m/k random graphs and perform the rest of processing
sequentially in the random graphs it owns. The results can then be sent to the root
process which can then combine all of them to produce the final output received
and report the final statistical significance.
The reported studies for distributed motif discovery are due to Wang et al., Schatz
et al., and Riberio et al., as described next .
Wang et al., presented one of the first studies for exact parallel motif discovery [32].
The main idea of this method is to define neighborhoods of nodes and limit the search
space by searching for motifs in these neighborhoods and perform these in parallel.
The neighborhood of a node v, N br (v), consists of nodes v such that d(v , v) ≤ k −1
for a given integer k. The first algorithm proposed constructs a BFS tree T (V , E )
from a node v at depth k – 1. The role of each node in N br (v) is also defined as
12.4 Distributed Motif Discovery 293
uncle, nephew, or hidden-uncle node. The level of a node u in BFS tree is level(u),
and uncle(u) is a node v that has the same level as the parent of u with an edge
between u and v, that is (u, v) ∈ E, but it is not the parent of u. The nephew of u is
the node w which is the child of the uncle node of u, as in the family relationships.
The uncles, nephews, and hidden-uncles are identified using the first algorithm.
The second algorithm adds more edges called the virtual edges to the existing
edges in N br (v). If two nodes u and v both have an edge to another node w where
level(u) = level(v) = level(w) + 1 with (u, v) ∈ / E, a virtual edge is added
between u and v. Virtual edges are also added between nodes and their hidden-
uncles. The auxiliary neighborhood Aux_N br (v) of a node v is obtained this way.
The third algorithm searches all k-connected subgraphs in Aux_N br (v). These three
algorithms are used to find motifs in parallel in a number of processes p0 , . . . , pk−1
with p0 as the supervisor and the rest of the processes as the workers. The p0 process
sorts the nodes according to their degrees and broadcasts this list to all workers in
the initial step as shown below. It then partitions the graph and sends the partitions
to each process each of which runs the three algorithms described for the nodes in
their partitions.
The results are gathered at the supervisor at the end of processing which then
checks isomorphism and decides on the motifs. This method has to be implemented
for a number of random graphs to determine the significance of the discovered motifs.
The authors applied this procedure for data obtained for interactions between tran-
scription factors and operans in E. coli network. The MPI library with C program-
ming interface was used in a 32-node cluster with two processors in each node.
The exhaustive search, random sampling, and the parallel algorithm were tested in
the E. coli transcriptional network, and the running times and precision of all three
294 12 Network Motif Search
methods were recorded. The parallel method proposed outperforms the other two in
time for motifs up to size 4 but is slower than random sampling method for larger
sizes. Precision of the parallel method is shown to be the best among all methods.
When both parameters are considered, parallel algorithm is reported to have the best
performance.
Riberio et al., provided three methods for distributed motif discovery. The first algo-
rithm makes use of a special data structure called g-trie, the second one is an attempt
to parallelize the ESU algorithm we have reviewed, and the third one is a method
to parallelize all three steps of the motif discovery process using the ESU algorithm
as an example. We will first describe the g-trie structure with a sequential algorithm
based on this structure and then review the distributed algorithms.
a o
t o r
a e r e
search the same graph for different queries which may be avoided. The idea is to
design an algorithm that is motif centric but does not use a single query motif but is
specialized in finding a set of input motifs. Similar to the discovery of DNA/RNA
and protein sequence motifs we have reviewed in sequences in Chap. 8, the notion of
searching repeated words in a sequence can be used here by searching for repeated
subgraph patterns. In a prefix trie, a path from the root to a leaf corresponds to a
prefix of the input word as shown in Fig. 12.9.
A prefix trie can be searched in linear time for the dictionary operation of finding
a word. It is also efficient in storing data as common prefixes are stored only once
rather than once for each word. The authors have applied these concepts for graphs
by introducing the Graph reTRIEval (G-trie) data structure for motif search which
is basically a prefix trie which has graphs as nodes instead of letters. Two types of
data are stored at each node of a g-trie: information about a single-graph vertex and
its incident edges to ancestor nodes, and a boolean variable to show whether it is
the last vertex of a graph. A g-trie is formed by iteratively inserting subgraphs to
it. Figure 12.10 shows the g-trie for subgraphs of up to size 4. Each path in a g-trie
corresponds to a unique subgraph.
In the first step of motif discovery, the g-trie data structure is built to search the set of
subgraphs in the target network G. The next step involves finding the g-tries graphs in
G and the random networks, and finally, statistical significance of the results obtained
should be determined. While searching for subgraphs in G, the isomorphism tests
are carried concurrently so that additional tests are not needed. Two approaches are
proposed in [23] to use g-tries for motif discovery:
296 12 Network Motif Search
2-subgraphs
3-subgraphs
1
4-subgraphs
2 3
4 5 6 7 8 9
Subgraph Set G-Trie
• G-trie use only: All possible subgraphs of a given size are generated, for example
by the gtools package [16],and then these are inserted in a g-trie structure. Then,
g-trie matching algorithm is applied to match nodes of the g-trie structure to the
target network G. For the discovered subgraphs, a new g-trie is formed and g-trie
matching is implemented in the random networks with this new g-trie.
• Hybrid approach: The g-trie structure is formed only for the discovered subgraphs
in G and g-trie matching is applied to random networks with this structure. Enu-
merating subgraphs in G can be performed by another network-centric algorithm
such as ESU.
Parallel G-tries
Parallel ESU
Riberio et al., selected ESU for parallelization as it is an efficient algorithm for sub-
graph enumeration [26]. The procedure E xt Sub in this algorithm (Algorithm 12.2)
recursively extends the subgraph and the main idea of parallelization is to distrib-
ute this procedure call to a set of processors. Observing each call to E xt Sub is
independent, it is called a primary work unit to be distributed to processors. A load
balancing strategy is needed as the amount of work by these work units vary greatly.
After experimenting with several methods, the authors chose the supervisor-worker
model where the workers ask for work units from the supervisor which sends them
these until all work units are finished. The supervisor merges all of the results and
computes the frequencies of the subgraphs at the completion of work units. The
ordering of the work units on the supervisor’s list of works plays an important role
on the overall performance of the system. The selected strategy is called largest
processing time first (LPTF) which was tested to be effective. There is still the prob-
lem due to the power-law distribution in biological networks which is displayed by
few very high-degree nodes and many low-degree ones. Executing the recursive calls
to find subgraphs around very large degree nodes requires high computation times,
and this source of imbalance is to be addressed. The authors devised a strategy to
prevent this situation by giving higher labels to these hubs than ordinary nodes with
lower degrees. This results in fewer neighbors with higher degree than the originating
node. The load balancing scheme incorporating the described features along with a
hierarchical data aggregation procedure as an important component of the method
proposed; it is called adaptive parallel enumeration (APE).
Ribeiro et al., implemented the methods presented in a 192-core cluster distributed
memory system with message passing using OpenMPI library. Various biological
networks such as neural, gene, metabolic, and protein interaction networks were
tested. Comparison of sequential g-tries with its parallel version showed the paral-
lel version is scalable up to 128 processors. Tests for parallel ESU showed linear
speedups were obtained for gene and metabolic networks but neural and PPI network
tests did not result in linear performance. The authors conclude APE consumes a lot
of time during the final aggregation of results.
provided an extended and more generalized form of parallel motif discovery, again
using the ESU algorithm [22] as in their previous work. Their earlier work involved
parallelizing a single subgraph census only, this time, they parallelized all of the steps
of the motif discovery process and introduced a distributed and dynamic receiver-
initiated load balancing policy. They argue the existence of the following paral-
lelization opportunities, similar to what we have discussed as general guidelines for
parallelism.
A single call to E xt Sub procedure is called an ESU work unit (EWU) which is the
call to E xt Sub with added network identifier. The main idea of the parallelization
is dividing EWUs among a number of processors. The parallel algorithm consists of
three main steps:
1. Pre-processing phase: All initialization including the initial work queue for each
process is performed.
2. Work phase: Two algorithms are used to analyze the network and find frequencies
of subgraphs.
3. Aggregation phase: The subgraph frequencies found by processes are gathered
and their significance is computed.
In the work phase, two policies called master-worker strategy and distributed
strategy are designed. In the former, the master performs load balancing and other
processes are involved in the work phase. The master waits for requests from workers
and sends them any available work. In the distributed scheme, each worker has a work
queue and if work units in this queue are finished, it asks for work from other workers.
Tests were carried out in the 192 core system using the same software environ-
ment as in [26]. The foodweb, social, neural, metabolic, protein, and power networks
were used to experiment the method designed. The authors reported speedups of their
parallel ESU method using up to 128 cores. First, they showed hierarchical method
performs better than naive method in aggregation phase. Second, they compared
master-worker and distributed control strategies for parallelization of whole motif
discovery process. In both methods, they achieve nearly linear speedups up to 128
cores but distributed control performs better because all cores are involved in com-
putation without wasting additional power. Finally, they compared the speedups of
12.4 Distributed Motif Discovery 299
distributed control using 128 cores when number of random networks varies from 0
to 1000. The algorithm scales well, but speedup decreases as the number of random
networks increases, as expected.
We started this chapter by first describing the network motifs which are the building
blocks of biological networks. Motifs are assumed to have biological significance
by carrying out special functions. Discovery of network motifs is a fundamental
research area in bioinformatics as it provides insight into the functioning of the
network. Network motifs may also be used to compare biological networks as similar
motifs found in two or more networks may mean similar functions are performed in
them by the same motifs. The network motif search problem consists of subgraph
enumeration, graph isomorphism, and determining statistical significance steps. We
need to enumerate all of the subgraphs in exact enumeration, or a part of the network
in the sampling-based enumeration. The discovered subgraphs need to be classified
into isomorphic classes in the second step which is a problem in NP. Finally, we
need to asses the statistical significance of the found subgraphs by generating similar
random graph to the original graph and applying the same procedures to these random
graphs.
We reviewed some of the sequential algorithms for motif discovery. An early
study to discover motifs is mfinder algorithm which enumerates all subgraphs. It has
limited usage as it is biased, requires large memory space and may find the same
motif more than once. The sampling version of mfinder selects edges with some
probability to enlarge subgraphs [11]. The ESU algorithm was presented by Wer-
nicke which handles some of the problems of mfinder; it is unbiased and discovers
motifs only once [34]. Kavosh uses levels of a tree to form subgraphs and achieves
discovery of motifs of sizes 8 and 9 [10]. NeMoFINDER uses trees to find sub-
graphs and is designed for PPI networks [4]. Motif Analysis and Visualization Tool
(MAVisto) is a network centric algorithm that uses a pattern growth method [29].
Grochow–Kellis [8] and MODA [21] algorithms are motif centric, MODA also has a
sampling version. Grochow–Kellis algorithm uses mapping with symmetry breaking
to discover motifs and MODA also uses mapping with a pattern growth tree. Exact
enumeration algorithms get slower as the size of the motif grows. FANMOD that
uses RAND-ESU and Kavosh are more efficient than others and Kavosh can find
motifs of larger sizes than FANMOD. FANMOD is faster than MODA; however,
MODA can discover larger motifs [37].
We then described general approaches for parallel motif discovery in distributed
memory architectures and then reviewed few sample algorithms reported in literature.
Wang’s algorithm [32] partitions the target graph for parallel operation and Schatz
provided a distributed version of Grochow–Kellis algorithm by mapping the queries
to a set of processors [27]. Riberio et al., provided three methods for parallel motif
discovery. The first algorithm is based on a novel data structure called g-trie [24,25],
300 12 Network Motif Search
(a) (b)
a b c
g
f e df
1 2 a b
e
3 4 d c
1. Find the motif shown in Fig. 12.11a in the graph shown in (b) of the same figure
using F1 , F2 and F3 frequency concepts.
2. Show that the two graphs shown in Fig. 12.12 are isomorphic by finding a per-
mutation matrix P that transforms one to the other.
3. Find the 3-node network motifs in the graph of Fig. 12.13 using the m f inder
algorithm.
6 5 4
12.5 Chapter Notes 301
6 5 4
4. Find the 3-node network motifs in the graph of Fig. 12.14 using the E SU algo-
rithm.
5. Compare exact census algorithms with approximate census algorithms in terms
of their performances and the precision of the results obtained in both cases.
References
1. Albert I, Albert R (2004) Conserved network motifs allow protein-protein interaction predic-
tion. Bioinformatics 20(18):3346–3352
2. Artzy-Randrup Y, Fleishman SJ, Ben-Tal N, Stone L (2004) Comment on network motifs: sim-
ple building blocks of complex networks and superfamilies of designed and evolved networks.
Science 305(5687):1007
3. Battiti R, Mascia F (2007) An algorithm portfolio for the subgraph isomorphism problem.
In: Engineering stochastic local search algorithms. Designing, Implementing and analyzing
effective heuristics, Springer, Berlin Heidelberg, pp 106–120
4. Chen J, Hsu M, Lee L, Ng SK (2006) NeMofinder: genome-wide protein-protein interactions
with meso-scale network motifs. In: Proceedings of 12th ACM SIGKDD international confer-
ence knowledge discovery and data mining (KDD’06), pp 106–115
5. Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P, Robertson LS, Skrzypek MS,
Braun BR, Hopkins KL, Kondu P, Lengieza C, Lew-Smith JE, Tillberg M, Garrels JI (2001)
Ypd(tm), pombepd(tm), and wormpd(tm): model organism volumes of the bioknowledge(tm)
library, an integrated resource for protein information. Nucleic Acids Res 29:75–79
6. Erciyes K (2014) Complex networks: an algorithmic perspective. CRC Press, Taylor and Fran-
cis, pp 172. ISBN 978-1-4471-5172-2
7. Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-
completeness. W. H. Freeman
8. Grochow J, Kellis M (2007) Network motif discovery using subgraph enumeration and
symmetry-breaking. In: Proceedings of 11th annual international conference research in com-
putational molecular biology (RECOMB’07), pp 92–106
9. Han J-DJ, N. Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJM,
Cusick ME, Roth FP, Vidal M, (2004) Evidence for dynamically organized modularity in the
yeast protein-protein interaction network. Nature 430(6995):88–93
10. Kashani ZR, Ahrabian H, Elahi E, Nowzari-Dalini A, Ansari ES, Asadi S, Mohammadi S,
Schreiber F, Masoudi-Nejad A (2009) Kavosh: a new algorithm for finding network motifs.
BMC Bioinform 10(318)
11. Kashtan N, Itzkovitz S, Milo R, Alon U (2002) Mfinder tool guide. Technical report, Department
of Molecular Cell Biology and Computer Science and Applied Mathematics, Weizman Institute
of Science
302 12 Network Motif Search
12. Kashtan N, Itzkovitz S, Milo R, Alon U (2004) Efficient sampling algorithm for estimating
sub-graph concentrations and detecting network motifs. Bioinformatics 20:1746–1758
13. Kreher D, Stinson D (1998) Combinatorial algorithms: generation, enumeration and search.
CRC Press
14. Lee TI et al (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science
298(5594):799–804
15. McKay BD (1981) Practical graph isomorphism. In: 10th Manitoba conference on numerical
mathematics and computing, Congressus Numerantium, vol 30, pp 45–87
16. McKay BD (1998) Isomorph-free exhaustive generation. J Algorithms 26:306–324
17. Mfinder. http://www.weizmann.ac.il/mcb/UriAlon/index.html
18. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs:
simple building blocks of complex networks. Science 298(5594):824–827
19. Milo R, Kashtan N, Levitt R, Alon U (2004) Response to comment on network motifs: simple
building blocks of complex networks and superfamilies of designed and evolved networks.
Science 305(5687):1007
20. Nauty User’s Guide, Computer Science Dept. Australian National University
21. Omidi S, Schreiber F, Masoudi-Nejad A (2009) MODA: an efficient algorithm for network
motif discovery in biological networks. Genes Genet Syst 84:385–395
22. Ribeiro P, Silva F, Lopes L (2012) Parallel discovery of network motifs. J Parallel Distrib
Comput 72(2):144–154
23. Ribeiro P, Efficient and scalable algorithms for network motifs discovery. Ph.D. Thesis Doctoral
Programme in Computer Science. Faculty of Science of the University of Porto
24. Ribeiro P, Silva F (2010) Efficient subgraph frequency estimation with g-tries. Algorithms
Bioinform 238–249
25. Ribeiro P, Silva F (2010) G-tries: an efficient data structure for discovering network motifs. In:
Proceedings of 2010 ACM symposium on applied computing, pp 1559–1566
26. Ribeiro P, Silva F, Lopes L (2010) A parallel algorithm for counting subgraphs in complex
networks. In: 3rd international conference on biomedical engineering systems and technologies,
Springer, pp 380–393
27. Schatz M, Cooper-Balis E, Bazinet A (2008) Parallel network motif finding. Techinical report,
University of Maryland Insitute for Advanced Computer Studies,
28. Schreiber F, Schwbbermeyer H (2005) Frequency concepts and pattern detection for the analy-
sis of motifs in networks. In: Transactions on Computational Systems Biology LNBI 3737,
Springer, Berlin Heidelberg, pp 89–104
29. Schreiber F, Schwbbermeyer H (2005) MAVisto: a tool for the exploration of network motifs.
Bioinformatics 21:3572–3574
30. Shen-Orr SS, Milo R, Mangan S, Alon U (2020) Network motifs in the transcriptional regulation
network of Escherichia Coli. Nat Gen 31(1):64–68
31. Ullmann JR (1976) An algorithm for subgraph isomorphism. J ACM 23(1):31–42
32. Wang T, Touchman JW, Zhang W, Suh EB, Xue G (2005) A parallel algorithm for extracting
transcription regulatory network motifs. In Proceedings of the IEEE international symposium
on bioinformatics and bioengineering, IEEE Computer Society Press, Los Alamitos, CA, USA,
pp 193–200
33. Wernicke S (2005) A faster algorithm for detecting network motifs. In: Proceedings of 5th
WABI-05, vol 3692, Springer, pp 165–177
34. Wernicke S (2006) Efficient detection of network motifs. IEEE/ACM Trans Comput Biol
Bioinform 3(4):347–359
35. Wernicke S, Rasche F (2006) FANMOD: a tool for fast network motif detection. Bioinformatics
22(9):1152–1153
36. Williams RJ, Martinez ND (2000) Simple rules yield complex food webs. Nature 404:180–183
37. Wong E, Baur B, Quader S, Huang C-H (2011) Biological network motif detection: principles
and practice. Brief Bioinform. doi:10.1093/bib/bbr033
Network Alignment
13
13.1 Introduction
Given two graphs G 1 (V1 , E 1 ) and G 2 (V2 , E 2 ), they are isomorphic if there is a one-
to-one and onto function f : V1 → V2 such that (u, v) ∈ E 1 ⇔ ( f (u), f (v) ∈ E 2 ).
Given two graphs G 1 (V1 , E 1 ) and G 2 (V2 , E 2 ), with G 1 being a smaller graph then
13.2 Problem Statement 305
Similar proteins
Species 1
Species 2
Species 3
Conserved interactions
Fig. 13.1 An alignment graph of the PPI networks of 3 species. The proteins inside the clusters
have similar structures. Conserved interactions or pathways exist in all of the species
A matching in a graph G(V, E) is a subset of its edges that do not share any endpoints
and this concept can be used conveniently for network alignment of two graphs.
Global network alignment can be performed by mapping vertices of two graphs
G 1 (V1 , E 1 ) and G 2 (V2 , E 2 ), or more graphs, and is usually performed in two steps.
In the first step, a similarity matrix R is constructed which has an entry ri j showing
the similarity score between the vertices i and j, one from each graph. A complete
weighted bipartite graph G B (VB , E B , w) where VB = V1 ∪ V2 and E B = {(vi , v j ) :
∀vi ∈ V1 , v j ∈ V2 } is then constructed. We then search for a maximal weighted
matching in G B as our aim is to find a mapping between V1 and V2 with the maximum
score. A weighted matching in a weighted bipartite graph is shown in Fig. 13.2. We
can therefore use any maximal weighted bipartite matching algorithm to discover
node pairs with high similarities as we will see in Sect. 13.5.
The topological similarity between the graphs under consideration is based on the
structure of the graphs. The node similarity on the other hand, searches the similarity
306 13 Network Alignment
between the structures of the nodes of the graph. In a PPI network for example, a
node structure is identified by the amino acids and their sequence that constitute
that node, however, node similarity may also take topological properties of nodes
into consideration such as their degrees or the similarity of their neighbors. Edge
correctness is a parameter to evaluate topological similarity defined as follows [38].
Definition 13.1 (Edge Correctness) Given two simple graphs G 1 and G 2 , Edge
Correctness (EC) is:
| f (E 1 ) ∩ E 2 |
EC(G 1 , G 2 , f ) = , (13.1)
|E 1 |
where f is the function that maps edges of the graph G 1 to G 2 . Edge correctness
basically shows the percentage of correctly aligned edges and helps to evaluate
the quality of the alignment achieved. Patro and Kingford proposed the induced
conserved structure (ICS) score metric which extends the EC concept [30]. This
parameter is specified as follows:
| f (E 1 ) ∩ E 2 |
ICS(G 1 , G 2 , f ) = , (13.2)
|E G 2 [ f (V1 )] |
where the denominator is the size of the edges induced in the second graph G 2 by
the mapped vertices. The goal is to align a sparse region of G 1 to a sparse region of
G 2 , or align their dense regions using ICS score to improve accuracy [9]. As another
measure of the quality of the alignment, genetic ontology (GO) consistency of the
aligned proteins is defined as the sum of the following over all aligned pairs [1].
|G O(u) ∩ G O(v)|
GOC(u, v) = , (13.3)
|G O(u) ∪ G O(v)|
for an aligned pair of nodes u ∈ V1 and v ∈ V2 , where GO(u) shows the GO terms
associated with the protein u that are at a distance 5 from the root of the GO hierarchy.
13.2 Problem Statement 307
A different alignment metric is the assessment of the size of the largest connected
component (LCC) shared by the input graphs. A larger LCC shows greater similarity
among the input graphs. The quality of the alignment achieved can be assessed by
statistical methods and by comparing the alignment with the alignment of random
networks of the same sizes [32].
The sequence alignment methods we have seen in Chap. 7 can be conveniently
used to find node similarities. The BLAST algorithm [2] is frequently used for
this purpose and Kuchaiev et al. proposed a scoring function between the nodes of
the graphs which evaluates the number of small connected graphs called graphlets
each node is included and therefore uses topological properties of nodes [20,21].
Various algorithms consider both topological and node similarities while assessing
similarities between networks and this is especially meaningful in PPI networks
where the function of a protein is dependent on both its position in the network and
its amino acid sequence.
The methods for network alignment vary in their choice of the type of alignment as
well as in their approach to solve this problem. In general, they can have one of the
following attributes:
1. Pairwise or Multiple Alignment
2. Node or Topological similarity information
3. Local or Global Alignment
Two networks are aligned in pairwise alignment and more than two networks are
aligned in multiple alignment which is computationally harder than the former. Path-
BLAST [16], MaWISh [19], GRAAL [20], H-GRAAL [26] and IsoRank [38] are
examples of pairwise alignment algorithms. Extended PathBLAST [36], Extended
IsoRank [39], and Graemlin [11] algorithms all provide multiple network alignment.
Functional or node similarity information looks at the information other than net-
work topology to evaluate similarity between the nodes whereas only topological
information is used to determine topological similarity between the nodes. In prac-
tice, a combination of both by different weighting schemes is used to decide on the
similarity scores between node pairs.
Local alignment methods attempt to map small subnetworks between two or more
species. For example, PPI networks of two species may have similar subnetworks
but they may not be evaluated as similar by a global alignment method. The logic
here is very similar to what we have discussed while comparing local and global
sequence alignment in Chap. 6. Example local alignment algorithms are NetAlign
[24] MaWIsh [19], PathBLAST [16], and NetworkBLAST [35]. Global network
alignment (GNA) aims to find the best overall alignment from a given set of input
graphs by detecting the maximum common subgraph between these networks. In
this method, each node of an input graph representing a network is either matched
308 13 Network Alignment
a a’
c c’
b b’
d d’
i e’
e
f f’
h h’
g g’
G1 G2
Fig. 13.3 An example GNA alignment between graphs G 1 and G 2 where a node x in G 1 is mapped
to a node x in G 2 . Edge (c, h) is deleted and edges (c , b ) and (g , h ) are added in G 2 ; gap node
i is not mapped
to some other node of another network or not which is usually shown by a gap.
The global sequence alignment aims to compare genomic sequences to discover
variations between species, and GNA similarly is used to understand the differences
and similarities across species [38]. GRAAL [20], H-GRAAl [26], MI-GRAAl [21],
C-GRAAL [27] IsoRAnk [38], and Graemlin [11] all provide global alignment.
Figure 13.3 displays a GNA between two graphs G 1 and G 2 . Aligning two graphs
only on node similarity can be performed by the Hungarian algorithm [22] in O(n 3 )
time.
In this section, we provide a brief review of existing algorithms and tools for network
alignment including recent ones.
13.3.1 PathBlast
The PathBLAST is a local network alignment and search tool that compares PPI
networks across species to identify protein pathways and complexes that have been
conserved by evolution [16]. It uses a heuristic algorithm with a scoring function
related to the probability of the existence of a path. It searches for high-scoring
alignments between pairs of PPI paths by combining the protein sequence similarity
information with the network topology to detect paths of high scores. The query path
13.3 Review of Sequential Network Alignment Algorithms 309
13.3.2 IsoRank
IsoRank algorithm is proposed by Singh et al. to find GNA between two PPI net-
works [38]. It uses both sequence similarity and local connectivity information for
alignment, and is based on the PageRank algorithm. It consists of two stages. It first
assigns a score between each node pairs, one from each network using eigenval-
ues, network, and sequence data; and builds the score matrix R. In the second step,
highly matching nodes from R are extracted to find the GNA. The general idea in
the construction of matrix R is that two nodes i and j are a good match if their
neighbors also match well. After computing R, the node matchings which provide a
maximum sum of scores is determined by using maximum bipartite graph matching.
The bipartite graph G B represents a network at each side and the weights of edges
are the entries of R. The maximum-weight matching of G B provides the matched
nodes and the remaining nodes are the gap nodes. The authors have tested IsoRank
for GNA of S. cerevisiae and D. melanogaster PPI networks and found a common
subgraph of 1420 edges.
13.3.3 MaWIsh
13.3.4 GRAAL
Graph Aligner (GRAAL) is a global alignment algorithm that uses topological sim-
ilarity information only [20]. It uses graphlets which are small, connected, non-
isomorphic subgraphs of a given size. Having two graphs G 1 (V1 , E 1 ) and G 2 (V2 , E 2 )
representing two PPI networks, it produces a set of ordered pairs (u, v) with u ∈ V1
and v∈V2 , by matching them using the graphlet degree signature similarity. Graphlet
degree signatures are computed for each node in each graph by finding the number of
graphlets up to size 4 they are included, and then assigning a score that reflects this
number. The scores of the nodes in the graphs are then compared to find similarities.
GRAAL algorithm first finds a single seed pair of nodes with high graphlet degree
signature similarity and afterwards enlarges the alignment radially around the seed
using a greedy algorithm.
H-GRAAL [26] is a version of GRAAL that uses the Hungarian algorithm.
The MI-GRAAL (Matching Integrative GRAAL) algorithm makes use of graphlet
degree signature similarity, local clustering coefficient differences, degree differ-
ences, eccentricity similarity, and node similarity based on BLAST [21], and C-
GRAAL (Common neighbors-based GRAAL) algorithm [27] is based on the idea
that the neighbors of the mapped nodes in two graphs to be aligned should have
mapped neighbors.
Natalie [10] is a tool for pairwise global network alignment and uses the Lagrangian
relaxation method proposed by Klau [17]. The GHOST is a spectral pairwise global
network alignment method proposed by Patro and Kingsford [30]. It uses the spectral
signature of a node which is related to the normalized Laplacian for subgraphs
of various radii centered around that node. SPINAL (scalable protein interaction
network alignment) proposed by Aladag and Erten [1] first performs a coarse-grained
alignment and obtains similarity scores. The second phase of this algorithm involves
a fine-grained alignment using the scores obtained in the first phase.
13.4 Distributed Network Alignment 311
A sequential greedy algorithm for MWBM problem can be designed which selects the
heaviest-weight edge (u, v) from the active edge set and includes it in the matching.
It then iteratively removes all edges incident to u and v from the active edge set and
continues until there are no edges left as shown in Algorithm 13.1.
312 13 Network Alignment
1 1
3 6 3 6
4 4
13.4 Distributed Network Alignment 313
We will illustrate this concept by a simple example where our aim is to find GNA
between two graphs G 1 with nodes a, b, c, d and G 2 which has p, q, r, s and there
are two available processes p0 and p1 . Let us assume the similarity matrix R, with
rows as the G 1 partition and columns as G 2 , is formed initially after which we
partition the rows equally among the two processes as below:
⎡ ⎤
1 8 13 2 | p0
⎢ 15 14 6 7 | ⎥
R=⎢ ⎣ 3 12 4 16 | p1 ⎦
⎥
11 10 5 9 |
Let us further assume p0 is the supervisor and starts the matching process by sending
the third and fourth rows of matrix R to p1 . The two processes both sort the edges
that are incident to their vertices in decreasing weights. In this case, p0 builds the
list L 0 with elements {(b, p), 15}, {(b, q), 14}, {(a, r ), 13}, {(a, q), 8}, {(b, s), 7},
{(b, r ), 6}, {(a, s), 2}, {(a, p), 1}. Similarly, p1 forms the list L 1 which contains
{(c, s), 16}, {(c, q), 12}, {(d, p), 11}, {(d, q), 10}, {(d, s), 9}, {(d, r ), 5}, {(c, r ), 4},
{(c, p), 3} and sends this list to p0 . The root p0 then merges these two sorted lists
and forms the new sorted list L N . In the final step of this algorithm, p0 starts to
include the edges in the matching from the front of L N as long as the edges obey
the matching rule, that is, the endpoints of an edge are not included in a previous
matching. In this example, the edges included in the matching in decreasing weight
are {(c, s), 16}, {(b, p), 15}, {(a, r ), 13} and {(d, q), 10} as shown in Fig. 13.5 giving
a total weight of 54.
314 13 Network Alignment
Analysis
Each process pi receives n × n/k data items and sorting this data requires (n 2 /k
log n 2 /k) time. The supervisor p0 receives the sorted lists each of which has a size
of n 2 /k and it needs to perform the merging of k lists to find the globally sorted list.
Merging two sorted lists of sizes l1 and l2 takes O(l1 + l2 ) time. Thus, merging of k
lists with n 2 /k size each, takes O(n 2 ) time and O(n) time for the maximal matching
in the last step. Therefore, total time for parallel execution has (n 2 /k log n 2 /k + n 2 )
time. The sequential case requires O(m log m) time to sort edges which is n 2 log n 2
since a full bipartite graph with n nodes on each partition has n 2 edges. The speedup
obtained is the ratio of the sequential algorithm to the distributed one which is
(k log n)/((k/2) + log n) ignoring the log k term. This indicates a slow increase of
time with increased k. For example, for a large full bipartite graph with 216 × 216
nodes and running the distributed algorithm with 4, 8, and 16 processors provides
speedups of 3.6, 6.4, and 10.7, respectively, all without considering the interprocess
communication costs.
Fig. 13.6 Running of Hoepman’s algorithm. a The original graph, b First phase of the algorithm
processing nodes a, b and c in sequence, c Output from the second phase of the algorithm
The strategy employed in a real-world auction can be used as the basis of MWBM
algorithms. The idea in these auction-based algorithms is to assign buyers to the most
valuable objects. In this model, given a weighted bipartite graph G(V1 ∪ V2 , E, w),
i ∈ V1 is considered as a buyer, j ∈ V2 is an object and the edge (i, j) between them
is the cots of buying object j. Each buyer i bids for only one object.
The basic auction algorithm was proposed by Bertsekas to solve the assignment
problem [6]. It has a polynomial time complexity and is also suitable for distributed
processing. We will first describe the sequential auction algorithm and then provide
its distributed version for maximal weighted bipartite graph matching. The bipartite
graph G(V1 ∪ V2 , E) we will consider has n nodes in each partition for a total of
2n nodes. The benefit of matching u ∈ V1 to v ∈ V2 is labeled as wuv which is the
weight of edge (u, v). Our aim is to find a matching between the nodes of V1 and
V2 such that total weight u∈V1 ,v∈V2 wuv is maximized. This algorithm considers
nodes of V1 as economic agents acting for their interest and V2 as objects that can
be bought. Each object j has a price p j and the person receiving this object pays
13.4 Distributed Network Alignment 317
this price. Algorithm 13.4 shows the operation of the sequential auction algorithm
as adapted from [34] which consists of three phases as initialization, bidding and
assignment.
There are several studies aimed at finding parallel versions of the auction algo-
rithm. Synchronous and asynchronous forms of the parallel algorithm was imple-
mented in shared memory computer architectures in [5]. The algorithms presented
in [7,33,34] all provide parallel auction algorithms that run on distributed memory
architectures. We will take a closer look at the algorithm of [34] in which the three
stages of the auction algorithm are parallelized. The bipartite graph is first distributed
among the k processes p1 , . . . , pk at the start of the algorithm. Each process pi is
responsible for a number of vertices and initializes its data structures and computes
the bids for free buyers in its vertex set. All free buyers compute a bid in a single
iteration and then the local prices are exchanged. A buyer becomes the owner of
the object if it has the highest global bid after the message exchanges. Sathe et al.
implemented this parallel algorithm using MPI in a distributed memory architecture
and showed speedups obtained [34].
318 13 Network Alignment
1. Find the maximum common subgraph between the two graphs shown in Fig. 13.7
by visual inspection.
2. The networks in Fig. 13.8 are to be aligned. Provide a possible alignment between
these two networks and work out the edge correction value.
13.5 Chapter Notes 319
a p
q
b
c r
m s
d
t
i n u
e z
f w
h
v
g x
y
G1 G2
e p
a i
b d i q
f m
c k
e n
r
h o l
g
G1 G2
3. Show the iteration steps of the greedy weighted bipartite matching algorithm on
the graph of Fig. 13.9. Find also the maximum weighted matching in this graph
and work out the approximation ratio of the greedy algorithm.
4. Show a possible execution steps of the distributed version of the greedy algorithm
on the graph of Fig. 13.10.
5. Hoepman’s algorithm in sequential form is to be executed to find MWBM in
the graph of Fig. 13.10. Show the running of this algorithm and work out the
approximation ratio with respect to the maximal weighted matching for this graph.
6. Compare the sequential network alignment tools IsoRank, PathFinder, GRAAL
and MaWIsh in terms of their application domains and performances.
320 13 Network Alignment
e t
16
G1 G2
12 2
10
d s
13
5
G1 G2
References
1. Aladag AE, Erten C (2013) SPINAL: scalable protein interaction network alignment. Bioin-
formatics 29(7):917–924
2. Altschul S, Gish W, Miller W, Myers E, Lipman D (1990) Basic local alignment search tool. J
Mol Biol 215(3):403–410
References 321
3. Avis D (1983) A survey of heuristics for the weighted matching problem. Networks 13(4):475–
493
4. Battiti R, Mascia F (2007) Engineering stochastic local search algorithms. designing, imple-
menting and analyzing effective heuristics, An algorithm portfolio for the subgraph isomor-
phism problem. Springer, Berlin, pp 106–120
5. Bertsekas DP, Castannon DA (1991) Parallel synchronous and asynchronous implementations
of the auction algorithm. Parallel Comput 17:707–732
6. Bertsekas DP (1992) Auction algorithms for network flow problems: a tutorial introduction.
Comput Optim Appl 1:7–66
7. Bus L, Tvrdik P (2009) Towards auction algorithms for large dense assignment problems.
Comput Optim Appl 43(3):411–436
8. Catalyurek UV, Dobrian F, Gebremedhin AH, Halappanavar M, Pothen A (2011) Distributed-
memory parallel algorithms for matching and coloring. In: 2011 international symposium
on parallel and distributed processing, workshops and Ph.D. forum (IPDPSW), workshop on
parallel computing and optimization (PCO11), IEEE Press, pp 1966–1975
9. Clark C, Kalita J (2014) A comparison of algorithms for the pairwise alignment of biological
networks. Bioinformatics 30(16):2351–2359
10. El-Kebir M, Heringa J, Klau GW (2011) Lagrangian relaxation applied to sparse global network
alignment. In: Proceedings of 6th IAPR international conference on pattern recognition in
bioinformatics (PRIB’11), Springer, pp 225-236
11. Flannick J, Novak A, Srinivasan BS, McAdams HH, Batzoglou S (2006) Graemlin: general
and robust alignment of multiple large interaction networks. Genome Res 16:1169–1181
12. Fortin S (1996) The graph isomorphism problem. Technical Report TR 96-20, Department of
Computer Science, The University of Alberta
13. Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-
completeness. W.H. Freeman, New York
14. Hoepman JH (2004) Simple distributed weighted matchings. arXiv:cs/0410047v1
15. http://www.pathblast.org
16. Kelley BP, Sharan R, Karp RM, Sittler T, Root DE, Stockwell BR, Ideker T (2003) Conserved
pathways within bacteria and yeast as revealed by global protein network alignment. Proc
PNAS 100(20):11394–11399
17. Klau GW (2009) A new graph-based method for pairwise global network alignment. BMC
Bioinform 10(Suppl 1):S59
18. Kollias G, Mohammadi S, Grama A (2012) Network Similarity Decomposition (NSD): a fast
and scalable approach to network alignment. IEEE Trans Knowl Data Eng 24(12):2232–2243
19. Koyuturk M, Kim Y, Topkara U, Subramaniam S, Szpankowski W, Grama A (2006) Pairwise
alignment of protein interaction networks. J Comput Biol 13(2):182–199
20. Kuchaiev O, Milenkovic T, Memisevic V, Hayes W, Przulj N (2010) Topological network
alignment uncovers biological function and phylogeny. J Royal Soc Interface 7(50):1341–1354
21. Kuchaiev O, Przulj N (2011) Integrative network alignment reveals large regions of global
network similarity in yeast and human. Bioinformatics 27(10):1390–1396
22. Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logistic Q
2:83–97
23. Kuramochi M, G. Karypis G (2001) Frequent subgraph discovery. In: Proceedings of 2001
IEEE international conference on data mining, IEEE Computer Society, pp 313–320
24. Liang Z, Xu M, Teng M, Niu L (2006) Comparison of protein interaction networks reveals
species conservation and divergence. BMC Bioinf. 7(1):457
25. Manne F, Bisseling RH, A parallel approximation algorithm for the weighted maximum match-
ing problem. In: Wyrzykowski R, Karczewski K, Dongarra J, Wasniewski J (eds) Proceedings
of seventh international conference on parallel processing and applied mathematics (PPAM
2007), Lecture notes in computer science, vol 4967. Springer, pp 708–717
322 13 Network Alignment
26. Milenkovic T et al (2010) Optimal network alignment with graphlet degree vectors. Cancer
Inf. 9:121–137
27. Memievic V, Pruzlj N (2012) C-GRAAL: common-neighbors-based global GRaph ALignment
of biological networks. Integr. Biol. 4(7):734–743
28. Messmer BT, Bunke H (1996) Subgraph isomorphism detection in polynomial time on pre-
processed model graphs. Recent developments in computer vision. Springer, Berlin, pp 373–
382
29. Ogata H, Fujibuchi W, Goto S, Kanehisa M (2000) A heuristic graph comparison algorithm and
its application to detect functionally related enzyme clusters. Nucleic Acids Res 28:4021–4028
30. Patro R, Kingsford C (2012) Global network alignment using multiscale spectral signatures.
Bioinformatics 28(23):3105–3114
31. Preis R (1999) Linear time 2-approximation algorithm for maximum weighted matching in
general graphs. In: C. Meinel, S. Tison (eds) STACS99 Proceeedings 16th annual conference
theoretical aspects of computer science, Lecture notes in computer science, vol 1563. Springer,
New York, pp 259–269
32. Przulj N (2005) Graph theory analysis of protein-protein interactions. In: Igor J, Dennis W
(eds) A chapter in knowledge discovery in proteomics. CRC Press
33. Riedyn J (2010) Making static pivoting scalable and dependable. Ph.D. Thesis, EECS Depart-
ment, University of California, Berkeley
34. Sathe M, Schenk O, Burkhart H (2012) An auction-based weighted matching implementation
on massively parallel architectures. Parallel Comput 38(12):595–614
35. Sharan R, Suthram S, Kelley RM, Kuhn T, McCuine S, Uetz P, Sittler T, Karp RM, Ideker T
(2005) Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci USA
102:1974–1979
36. Sharan R et al (2005) Identification of protein complexes by comparative analysis of yeast and
bacterial protein interaction data. J Comput Biol 12:835–846
37. Sharan R, Ideker T (2006) Modeling cellular machinery through biological network compari-
son. Nat Biotechnol 24(4):427–433
38. Singh R, Xu J, Berger B (2007) Pairwise global alignment of protein interaction networks by
matching neighborhood topology. In: Research in computational molecular biology, Springer,
pp 16-31
39. Singh R, Xu J, Berger B (2008) Global alignment of multiple protein interaction networks with
application to functional orthology detection. PNAS 105(35):12763–12768
40. Ullmann JR (1976) An algorithm for subgraph isomorphism. J ACM 23(1):31–42
41. Yan X, Han J (2002) Gspan: graph-based substructure pattern mining. In: Proceedings of IEEE
international conference on data mining, pp 721–724
Phylogenetics
14
14.1 Introduction
One of the fundamental assumptions in biology is that all living organisms share
common ancestors. Phylogeny is the study of this evolutionary relationships among
organisms. Phylogenetics is the study of these associations through molecular
sequence and morphological data and aims to reconstruct evolutionary dependen-
cies among the organisms. The earlier attempts to discover phylogenetic relationships
between organisms by biologists involved using only morphological features. Data
of living organisms under consideration in current research activities is a combi-
nation of DNA and protein amino acid sequences, and sometimes morphological
characters. Finding these relations between organisms has many implications, for
example, the prediction of disease transmission patterns using phylogenetics can be
achieved to understand the spread of contagious diseases [6]. Phylogenetics can also
be implemented in medicine to learn the origins of diseases and analyze the disease
resistance mechanisms in other organisms to design therapy and cure in humans. For
example, the evolution of viral pathogens such as flu viruses can be analyzed using
phylogeny to design new vaccines.
Mutations are the driving forces behind evolution. A mutation event in an organ-
ism will be carried to offsprings. Many mutations are harmless and some result
in improved traits, yet some are harmful and sometimes lethal to the organism.
Mutations accumulated over time may result in the generation of new species. The
evolutionary relatedness among the organisms may be depicted in a phylogenetic
tree structure which shows the ancestor/descendant relationships between the living
organisms under consideration in a tree topology. The leaves of this tree represent
the living organisms and the interior nodes are the hypothetical ancestors.
However, a phylogenetic tree may not be adequate to represent the evolution-
ary relatedness between all organisms. A horizontal gene transfer is the transfer of
genetic material between different and sometimes distantly related organisms and a
phylogenetic tree does not represent such an event. A vertical gene transfer on the
other hand involves the transfer of genetic material to descendants and is the basis of
© Springer International Publishing Switzerland 2015 323
K. Erciyes, Distributed and Sequential Algorithms for Bioinformatics,
Computational Biology 23, DOI 10.1007/978-3-319-24966-7_14
324 14 Phylogenetics
14.2 Terminology
(a) (b)
B C
A
D
E D E
B C A
Fig. 14.1 a An unrooted phylogenetic tree. b The rooted version of the same tree
326 14 Phylogenetics
a tree that minimizes the number of changes to data, and the maximum likelihood
method searches all trees selecting the most probable tree that explains the input
data.
The input to a distance-based method is the distance matrix D[n, n] for n taxa where
di j is the distance between taxa i and j. Our aim is to construct a weighted-edge
tree T where each leaf represents a single taxon and the distance between leaves i
and j is approximately equal to the entry di j of D. We need to define the important
properties of the phylogenetic trees as below:
Definition 14.1 (metric tree) A metric tree T (V, E, w) is a rooted or unrooted tree
where w : E → R+
The metric trees have three main properties. All edges of a metric tree have
nonnegative weights, and di j = d ji for any two nodes i, j ∈ V , implying symmetry.
Also, for any three nodes i, j, k in a metric tree, dik ≤ di j + d jk showing triangle
inequality.
Definition 14.2 (additive tree) A tree is called additive if for all taxa pairs, the
distance between them is the sum of the edge weights of the path between them.
Figure 14.2 shows such an additive tree and its associated distance matrix.
Definition 14.3 (ultrametric tree) A rooted additive tree is called ultrametric if the
distance between any of its two leaves i and j and any of their common ancestors
k is equal (dik = d jk ). Therefore, for any two leaves i and j and the root r of an
ultrametric tree, d(i, r ) = d( j, r ).
The distances of such a tree are said to be ultrametric if, for any triplet of sequences
the distances are either all equal, or two are equal and the remaining one is smaller.
This condition holds for distances derived from a tree with molecular clock. In other
words, the sum of times down a path to the leaves from any node is the same, whatever
A
A B C D 2 C
A 0 4 8 8 1
5
B 0 8 8
C 0 2
1
D 0 2 D
B
Fig. 14.2 An additive tree and its associated distance matrix for taxa A, B, C, and D
14.3 Phylogenetic Trees 327
the choice of path. This means that the divergence of sequences can be assumed to
occur at the same constant rate at all points in the tree. The edge lengths in the
resulting tree can therefore be viewed as times measured by a molecular clock with
a constant rate. The ultrametric tree property implies additivity but not vice versa. If
the weights of the edges of a rooted metric tree represent the times, all of the leaves
should be equidistant from the root of an ultrametric tree as this distance represents
the time elapsed since the MRCA which should be the same for all taxa. In short,
ultrametric tree property implies the existence of a molecular clock which means a
constant rate of evolution.
Definition 14.4 (distance between two clusters) Given two disjoint clusters Ci and
C j , the distance di j between them is the sum of the distances between the sequences
in clusters normalized by the size of the clusters as follows:
u∈Ci v∈C j duv
di j = (14.1)
|Ci ||C j |
We are computing the average distance between the two clusters in this case.
We may, however, use the single-link distance for simplicity, which is the shortest
distance between the clusters. The following lemma provides an efficient way to
calculate distances between two clusters one of which already consists of two clusters.
Lemma 14.1 (efficiency) Let us assume a cluster Ck consists of two clusters Ci and
C j such that Ck = Ci ∪ C j . The distance of Ck to another cluster Cl can be defined
as follows:
dil |Ci | + d jl |C j |
dkl = (14.2)
|Ci | + |C j |
The proof is trivial and this lemma helps us to find the distance between two
clusters in linear time. The UPGMA algorithm uses the above-defined two distance
metrics to find the distances between the newly formed cluster to all other clusters.
The steps of UPGMA are as follows:
328 14 Phylogenetics
A B C D X
A 0 6 6 6
0.5 0.5
B 0 3 3
C 0 1
D 0
C D
A B CD 1
A 0 6 6 1.5
X
B 0 3
CD 0 0.5 0.5
C D B
1.5
A BCD Y
A 0 6 1
BCD 0 3
1.5
X
0.5 0.5
C D B A
The time taken to find the smallest value element is O(n 2 ) since the distance
matrix has n 2 elements and there is a total of n − 1 iterations of the algorithm for
n taxa. The total time therefore is O(n 3 ). The UPGMA has the disadvantage of
assuming an ultrametric and also additive tree.
t1 t2 t3 t4 t5 t6
t1 0 18 2 18 1 18 p0
t2 18 0 18 4 18 4
t3 2 18 0 18 2 18 p1
t4 2 18 0 18 2 18
t5 1 18 2 18 0 18 p2
t6 18 4 18 3 18 0
Fig. 14.6 Distributed UPGMA first round. The minimum distance is between t1 and t5 which are
clustered to form the two leaves of the phylogenetic tree
t1 t5 t1 t5 t3 t4 t6
7
(d) (e)
8
0.5
2 2
0.5 1.5
1.5 1
1.5 1.5
0.5
0.5
t2 t4 t6 t1 t5 t3 t2 t4 t6
Fig. 14.7 The partial trees constructed with the distributed algorithm in each round. The trees in
(a),...,(e) correspond to the rounds 1,...,5. The tree in (e) is final
is between t1 and t5 . Therefore, the global minimum distance is between the taxa t1
and t5 and these are grouped to form the leaves of the tree in the first step.
This procedure continues for five rounds at the end of which, the final tree shown
in Fig. 14.7 is constructed.
k k
i dik i dil
d ij x dkl x dkl
d ij
a a
j djl j djk
l l
The additive tree condition meant that for any two leaves, the distance between
them is the sum of edge weights of the path between them. We need a method to
check if a tree is additive or not by inspecting the distance matrix. We can now state
the four-point condition between four taxa.
Definition 14.5 (four-point condition) Given four taxa i, j, k, and l, the four-point
condition holds if two of the possible sums dil + d jk , dik + d jl and di j + dkl are
equal and the third one is smaller than this sum.
As can be seen in Fig. 14.8, the possible distances between four taxa can be
specified as follows:
dil + d jk = T + 2a
dik + d jl = T + 2a
di j + dkl = T
where T is the sum of the distances of the leaves to their ancestors. This would
mean that the larger sum should appear twice in these three sums. A distance matrix
D[n, n] is additive if and only if the four-point condition holds for all of its four
elements. We can check this condition for the distance matrix of Fig. 14.2. The three
sums in this case are as follows:
d AB + dC D = 4 + 2 = 6
d AC + d B D = 8 + 8 = 16
d AD + d BC = 8 + 8 = 16
The two sums are equal and the third sum is smaller than this sum, so this matrix
is additive. The NJ algorithm assumes additive tree property but this tree does not
have to be ultrametric. It has a similar structure to the UPGMA and takes the distance
matrix D which has an entry di j showing the distance between clusters i and j as
input, and merges the closest clusters at each iteration to form a new cluster. However,
the distance computations are different than UPGMA where the general idea of this
algorithm is to merge the two clusters that are close (minimize the distance between
the two clusters) but also farthest from the rest of the clusters (maximize their distance
to all other clusters). UPGMA did not aim for the latter. This method starts with a
starlike tree as shown in Fig. 14.9 and iteratively forms clusters as shown.
332 14 Phylogenetics
Y
E E
E
F F F
Fig. 14.9 Constructing a tree using NJ algorithm. a A star with six taxa (A, B, C, D, E, and F)
as leaves is formed. b The two leaves A and B that have the minimum of the distance between
them but are farthest to all other clusters are joined to form a new cluster and the distances to this
new cluster from all clusters are calculated. c The two closest to each other and farthest to all other
clusters are D and E and they are joined and the process is repeated
Let us now define the separation u i of a cluster i from the rest of clusters as
follows:
1
ui = di j (14.3)
n−2
j
where n is the number of clusters. The aim in NJ algorithm is then to find the two
clusters that have the minimum distance between them and also have the highest
distance to all other clusters. In other words, it searches for the cluster pair i and
j with the minimum value of di j − u i − u j . These two clusters are replaced by a
single node x in the tree and the distance of all other clusters to x are calculated. The
distance between the node x representing the merged clusters i and j and the other
clusters k and l in this case is as follows.
1
dxk =
(dik + d jl − di j ) (14.4)
2
This is exactly how the distance of the newly formed cluster (i j) to all other
clusters is computed in the algorithm. We are now ready to review the NJ algorithm
which consists of the following steps:
1. Each node i is assigned as a cluster initially.
2. Repeat
1
a. For each cluster i, compute the separation to each cluster j as u i = n−2 di j .
b. Find the two clusters i, j where di j − u i − u j is minimal.
c. Define a new cluster (i j) with a node x representing this cluster. Calculate
branch lengths from i and j to x as follows:
1 1
di x = (di j + u i − u j ), d jx = (di j + u j − u i ) (14.5)
2 2
14.3 Phylogenetic Trees 333
d. Compute the distance between the new cluster (i, j) and each cluster k using
additive property of Eq. 14.4 as
1
d(i j)k = (dik + d jl − di j ) (14.6)
2
e. Remove clusters i and j from the distance matrix and replace them by a single
cluster
3. Until two clusters i and j remain
4. Connect the two clusters i and j by a branch length di j
Finding the smallest distance takes n 2 time in the step 2b of the algorithm and the
main loop is executed n times resulting in O(n 3 ) time complexity. Let us illustrate
this algorithm using an example; we are given a distance matrix D between four taxa
A, B, C, and D and then the initial separation values u i for each taxon are calculated
using Eq. 14.5 as shown in Fig. 14.10.
We can now find the distance values from each taxon to the others using the step 2b
of the algorithm. We have two minimum values and we randomly select leaves B and
C to form the first cluster as shown. The distance between the common ancestor X of
B and C is calculated according to Eq. 14.5 and we find d B X = 21 (10 + 19 − 25) = 2
and d AX = 21 (10 + 25 − 19) = 8. This is represented as asymmetric tree branches
as shown in Fig. 14.11.
We can now calculate the distances of all other taxa (A and D) to this new node
X using Eq. 14.6. In this case, d AX = 21 (9 + 15 − 10) = 7 and d D X = 21 (19 +
25 − 10) = 17. We can now form the new distance matrix between taxa A and D
and intermediate node X and find the new separation values in the second iteration
of the algorithm as shown in Fig. 14.12.
The minimum values are all equal and we arbitrarily select A and D to cluster and
these are merged into a new cluster which actually terminates the iteration as we have
two clusters (A, D and B, C) now. We continue with the evaluation of the distances
of the two children nodes A and D to the new ancestral node Y using Eq. 14.5. The
distances are d AY = 21 (16 + 23 − 33) = 3 and d DY = 21 (16 + 33 − 23) = 13.
A B C D 2
A -30 -30 -34 8
B -34 -30 B
C -30
D 0
C
Fig. 14.11 NJ algorithm first iteration. The distance values are determined
334 14 Phylogenetics
13
C
Selecting any other two clusters would result in the same phylogenetic tree in Fig.
14.13.
The distance between the two intermediate nodes X and Y can be calculated using
Eq. 14.6. giving d X Y = 21 (9 + 15 − 10) = 4 and all of the subtrees we have built can
be combined to give the final tree of Fig. 14.14 which represents the initial distance
matrix precisely.
Distributed Neighbor Joining Algorithms
The operation of UPGMA and NJ algorithms is similar as they are both agglomerative
hierarchical clustering algorithms, however, the calculation of distances are different.
Therefore, we can proceed similar to the parallelization of the UPGMA in order to
provide a parallel NJ algorithm to be executed on a distributed memory computer
system. Our idea in this algorithm is again to partition n rows of the distance matrix
to p processors and the algorithm is similar in structure to distributed UPGMA
algorithm. The speedup obtained will be the same as the number of processors in
this case.
Ruzgar and Erciyes combined NJ algorithm with fuzzy clustering to construct
phylogenetic trees of Y-DNA haplogroup G of individuals [33]. Their algorithm
consists of the following steps:
14.3 Phylogenetic Trees 335
The basic idea of the maximum parsimony method is based on the philosophical idea
of Ockham called Ockham’s razor, that the best hypothesis to explain a complex
process is the one that requires fewest assumptions. Adapting this idea to constructing
phylogenetic trees, we need to find the tree that has the lowest number of mutations
to explain the input taxa. We need to test a number of trees that provide the taxa at
leaves and count the number of mutations in each of them. The one with the lowest
number should be best to represent the real evolutionary process according to this
method. There may be more than one solution giving the same number of mutations.
The parsimony problem can be inspected at two levels as small parsimony problem
and large parsimony problem which are described next.
Fitch’s Algorithm
Fitch’s algorithm is a dynamic algorithm to compute the minimum number of muta-
tions in a tree to explain the given taxa [13]. The state of an intermediate node is
determined by the states of its children. The algorithm first starts from the leaves of
the tree and for each parent of a leaf, it assigns a label which is the intersection of
the labels of its children if this intersection is not empty. Otherwise, the labeling of
the parent is the union of its children. The idea is to label a parent with the common
elements of the children only if the children have some common elements. All of the
uncommon elements must then have occurred for each child. For totally unrelated
children, the parent must have had all of the uncommon elements to pass them to
the children. It relabels the intermediate nodes starting from the root in the second
phase. Formally, the algorithm consists of the following steps:
1. For each leaf v, Sv ← X v where X v is the character label of v
2. C ← 0, the parsimony score is initialized
3. Starting from the leaves, traverse the tree upwards to the root. For any internal
node u with children v and w,
Sv ∩ Sw if Sv ∩ Sw = Ø
Su =
Sv ∪ Sw ; C ← C + 1 otherwise
4. We need to finalize the X v for each node v now. We start from the root and traverse
the tree downward to the leaves. For each node u with a parent v
Su ← Sv if Su ∈ Sv
Su =
arbitrarily assign any y ∈ Sv to Su otherwise
G
(a) C (b)
(X)
GC
GC
(X)
(X) CA AG C
G (X)
G G C A C G A G C C
(c) AG (d) C
(X) GAC
GC
AG (X) GC C
(X) G
G A A G C G G C A C
Fig. 14.15 Fitch’s algorithm implementation for five taxa: G, G, A, A, and C. The mutations are
marked with a ×. a A possible tree with a Finch score of 2, b Another tree with the same score, c
A tree with a score of 3, d The final labeling of the first tree
Processing of one node in the first phase takes O(l 2 ) time where l is the number
of different states of a character. For n input taxa with length m, time required is
O(nml 2 ).
Q GGAT
3 3 1 0 GCAT
AGAT R
2 2
GGAT
2
P R 1 0 S
ATAT
AGAT
1 1 1 0
2 1 P
CTAG ATAT
T S Q T
2
Fig. 14.16 a The MST of K 5 shown by bold lines, using the Hamming distances between taxa, b
phylogenetic tree formed using the MST. The parsimony score is 5
A D A B A D
C C D
B B C
A E D A E B A E D
Fig. 14.17 Enlarging an initial tree with three leaves A, B, and C. The new edge with taxon D is
placed between all possible taxa; A and B, A, and C, and B and C resulting in three more trees.
Adding the new edge for taxon E results in 15 more trees some of which are shown
A D A B A B
u v u v u v
B C C D D C
a number for them to each process pi and have each pi work out the labeling of
intermediate nodes using Fitch’s or Sankoff’s algorithm. However, this method will
not suffice even for small input sizes as the number of trees grow exponentially. We
will therefore search ways of parallelizing the schemes we have outlined for large
parsimony, namely, the 2-approximation algorithm, and B&B and the NNI methods.
Parallelizing the 2-Approximation Algorithm
Let us briefly review the steps of the 2-Approximation algorithm and sketch possible
ways of parallel processing using distributed memory processors in these steps. There
is a central process p0 and there are k processes p0 , . . . , pk−1 as before.
1. We need to compare characters of each taxa pair and this needs to be done O(n 2 )
times. We can simply distribute n taxa to k processes where each pi is responsible
for comparing the taxa in its Ti to all other taxa. Each pi finds the Hamming
distances Di for taxa in Ti and sends the partial distance matrix Di to the root
process p0 .
2. The root p0 gathers all partial results and forms the upper triangle matrix D. It
can then perform construction of the MST in linear time using any of the MST
algorithms.
3. The root can now build the approximate parsimonious tree using the MST infor-
mation.
tree was implemented. The method proposed was tested using data sets consisting
of 24 sequences, each with a length of 500 base pairs and was shown to be 1.2–7
times faster than the phylogenetic analysis using parsimony (PAUP) tool [16] which
is used for inferring and interpreting phylogenetic trees. The NNI algorithm also
has potential to be executed in parallel. After exchanging subtrees, we would have a
number of trees which can be evaluated for parsimony in parallel.
nodes and evaluates the goodness of the trees as new edges representing new taxa are
added. It then distributes evenly, only the good trees to the worker processes. These
processes then add new taxa to the received trees and evaluate the ML scores for the
newly added trees and send the root to only the good trees. The root, having received
good trees from all workers, determines a number of good trees among them and
sends these to the processes to be evaluated for the next round. The parallel algorithm
proposed in [5] follows a similar strategy in which a reference tree is broadcasted to
the workers by the root process which generate a number of trees, evaluate their ML
scores, and return these to the root to be processed further to find the best tree.
For parallel ML analysis, a number of tools have been developed. A distributed ML
algorithm is presented in [24]. TREE-PUZZLE is based on MPI [37] and uses quartet
puzzling method, and RAxML [40] is another parallel tool based on ML method.
MultiPhyl [25] and DPRml [24] are also based on ML and can be implemented on
heterogeneous platforms. Distributed and parallel algorithms for inference of huge
phylogenetic trees based on the ML method are described in [36].
Phylogenetic trees are widely used to infer evolutionary relationships between organ-
isms as we have seen and they represent speciation events and descent with modifi-
cation events. However, they do not represent reticulate events such as gene transfer,
hybridization, recombination or reassortment [21]. A phylogenetic network can be
used to discover evolutionary relatedness of organisms when reticulate events in these
organisms are significant [8,15,29]. Formally, a phylogenetic network is a directed
acyclic graph (DAG) G(V, E) with V = VL ∪ VT ∪ VN where VL , VT , and VN are
defined as follows:
• VL : Set of leaves of G
• VT : The tree nodes where each node has in-degree of 1 and out-degree of 2
• VN : The network nodes where each node has in-degree of 2 and out-degree of 2
Tree nodes represent mutations and the network nodes describe reticulation events
and they are sometimes called as reticulate nodes and any input edge to these nodes
are the reticulate edges. The phylogenetic networks can also be rooted and unrooted
as the phylogenetic trees. In a rooted phylogenetic network, a special node with an
in-degree of 0 should be present. Figure 14.19 displays a rooted and an unrooted
phylogenetic network.
An unrooted phylogenetic network is basically an unrooted graph with leaves
labeled by the taxa [22]. The two main types of such networks are the split networks
and the quasi-median networks. A split is a partition of taxa into two nonempty sets
and a split network is a connected graph where edges are splits.
344 14 Phylogenetics
(a) (b)
Fig. 14.19 a An unrooted phylogenetic network, b A rooted phylogenetic network. All of the
intermediate nodes in both networks are network nodes
We will now take a closer look at median networks as they are frequently used
to find evolutionary relationships of species. Median networks were developed to
help visualize the mtDNA sequences [2]. A node in a median network represents
a sequence of a multiple sequence alignment and two nodes are connected by an
edge if they differ by only one mutation [22]. The median-joining method proposed
by Bandelt et al. constructs phylogenetic networks conveniently [3]. In this method,
minimum spanning trees that contain only observed taxa as nodes are first con-
structed. In other words, these trees do not have hypothetical nodes as in an ordinary
phylogenetic tree or a network. The distances between the taxa are the Hamming
distances between the sequences. These MSTs are then combined into a minimum
spanning network.
Basically, the median-joining method constructs a subset of full quasi-median
network using the minimum spanning network. The network constructed is not as
complex as a quasi-median network but it is not as simple as a minimum spanning
network either. The nodes in the median-joining network are the taxa and the hypo-
thetical nodes. Relationships are evaluated only for taxa that are close to each other
in the minimum spanning network. Median-joining networks are commonly used to
study the relationships between haplotypes.
Phylogeny is the study of the evolutionary relationships between organisms. The aim
is to infer the evolutionary relatedness using data of living organisms or sometimes
data obtained from nonliving species. The input data is either DNA/RNA nucleotide
sequences or protein amino acid sequences, and sometimes both. We need to find the
similarities of organisms first by comparing their sequence data which is frequently
accomplished by multiple sequence alignment methods reviewed in Chap. 6.
We have analyzed two main algorithms for distance-based tree construction as
the UPGMA and the NJ algorithms. In both methods, the most similar taxa are
grouped together and the distances of the remaining clusters are computed to this new
cluster. UPGMA assumes ultrametric tree property where two taxa are equidistant to
their any common ancestor which assumes the existence of a molecular clock with
constant evolution rate. The NJ algorithm does not assume the ultrametric property
and uses the additive tree concept when computing the distances. The NJ algorithm is
more frequently used in constructing phylogenetic trees than the UPGMA, however,
neither of these algorithms depend on the site of the mutation, resulting in decreased
accuracy.
Character-based tree construction methods in general have a higher time com-
plexity than distance-based methods. Parsimony-based methods search for a tree
that has the minimum number of mutations to explain the input taxa. Maximum
likelihood and Bayesian trees are the probability-based methods to construct evolu-
tionary trees. The likelihood of each tree is computed in the first method and the tree
with the maximum likelihood is selected. PHYLIP [11] and MEGA [26] are tools
346 14 Phylogenetics
Exercises
1. Given the following distance matrix for five taxa, work out the phylogenetic tree
using the UPGMA. Check the consistency of the constructed tree with the distance
matrix.
A B C D E
A 0 4 8 8 6
B 0 8 8 5
C 0 2 5
D 0 3
E 0
14.5 Chapter Notes 347
2. The following distance matrix for four taxa is given. Construct the phylogenetic
tree using the NJ algorithm by showing the partial trees constructed at each
iteration. Check the taxa additive tree property first. Describe also the distributed
implementation of NJ algorithm for this set of taxa using two processes.
A B C D
A 0 4 8 8
B 0 8 8
C 0 2
D 0
3. For the NJ algorithm example which resulted in the tree of Fig. 14.14, show that
choosing any other cluster pair in the last iteration of the algorithm will result in
the same phylogenetic tree output.
4. The six input taxa are A, G, G, T , and A. Use Fitch’s algorithm to construct a
maximum parsimony tree for these taxa using the tree of Fig. 14.20. Workout the
total number of mutations for each labeling of the tree.
5. Find the labeling of the vertices of the phylogenetic tree of Fig. 14.21 using
Sankoff’s algorithm. Show the vectors at each node including taxa. The weights
for mutations are as shown in the table.
6. Discuss briefly the relation between the similarity of input taxa and the choice of
tree construction method.
A C G T
A 0 3 1 2
G A G A C C 0 2 1
G 0 3
T 0
References
1. Addario-Berry L, Hallett MT, Lagergren J (2003) Towards identifying lateral gene transfer
events. In: Proceedings of 8th pacific symposium on biocomputing (PSB03), pp 279–290
2. Bandelt HJ, Forster P, Sykes BC, Richards MB (1995) Mitochondrial portraits of human pop-
ulations using median networks. Genetics 141:743–753
3. Bandelt HJ, Forster P, Rohl A (1999) Median-joining networks for inferring intraspecific phy-
logenies. Mol Biol Evol 16(1):37–48
4. Bandelt HJ, Macaulay V, Richards M (2000) Median networks: speedy construction and greedy
reduction, one simulation, and two case studies from human mtDNA. Mol Phyl Evol 16:8–28
5. Blouin C, Butt D, Hickey G, Rau-Chaplin A (2005) Fast parallel maximum likelihood-based
protein phylogeny. In: Proceedings of 18th international conference on parallel and distributed
computing systems, ISCA, pp 281–287
6. Colijn C, Gardy J (2014) Phylogenetic tree shapes resolve disease transmission patterns. Evol
Med Public Health 2014:96–108
7. DasGupta B, He X, Jiang T, Li M, Tromp J, Zhang L (2000) On computing the nearest neighbor
interchange distance. Proceedings of DIMACS workshop on discrete problems with medical
applications 55:125–143
8. Doolittle WF (1999) Phylogenetic classification and the Universal Tree. Science 284:2124–
2128
9. Dunn JC (1974) Well separated clusters and optimal fuzzy partitions. J Cybern 4:95–104
10. Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach.
J Mol Evol 17(6):368–376
11. Felsenstein J (1991) PHYLIP: phylogenetic inference package. University of Washington,
Seattle
12. Felsenstein J (2004) Inferring Phylogenies. 2nd edn. Sinauer Associates Inc., Chapter 2
13. Fitch WM (1971) Toward defining course of evolution: minimum change for a specified tree
topology. Syst Zool 20:406–416
14. Gast M, Hauptmann M (2012) Efficient parallel computation of nearest neighbor interchange
distances. CoRR abs/1205.3402
15. Griffiths RC, Marjoram P (1997) An ancestral recombination graph. In: Donnelly P, Tavare
S (eds) Progress in population genetics and human evolution, volume 87 of IMA volumes of
mathematics and its applications. Springer, Berlin (Germany), pp 257–270
16. http://paup.csit.fsu.edu/
17. Hallett MT, Lagergren J (2001) Efficient algorithms for lateral gene transfer problems. Proceed-
ings 5th annunal international conference on computational molecular biology (RECOMB01).
ACM Press, New York, pp 149–156
18. Hendy MD, Penny D (1982) Branch and bound algorithms to determine minimal evolutionary
trees. Math Biosci 60:133–142
19. Huber KT, Watson EE, Hendy MD (2001) An algorithm for constructing local regions in a
phylogenetic network. Mol Phyl Evol 19(1):1–8
20. Huson DH (1998) SplitsTree: a program for analyzing and visualizing evolutionary data. Bioin-
formatics 14(1):68–73
21. Huson DH, Scornavacca C (2011) A survey of combinatorial methods for phylogenetic net-
works. Genome Biol Evol 3:23–35
22. Huson DH, Rupp R, Scornavacca C (2010) Phylogenetic networks. Cambridge University Press
23. Jin G, Nakhleh L, Snir S, Tuller T (2007) Inferring phylogenetic networks by the maximum
parsimony criterion: a case study. Mol Biol Evol 24(1):324–337
24. Keane TM, Naughton TJ, Travers SA, McInerney JO, McCormack GP (2005) DPRml: distrib-
uted phylogeny reconstruction by maximum likelihood. Bioinformatics 21(7):969–974
25. Keane TM, Naughton TJ, McInerney JO (2007) MultiPhyl: a high-throughput phylogenomics
webserver using distributed computting. Nucleic Acids Res 35(2):3337
References 349
26. Kumar S, Tamura K, Nei M (1993) MEGA: molecular evolutionary genetics analysis, ver. 1.01.
The Pennsylvania State University, University Park, PA
27. Lemey P, Salemi M, Vandamme A-M (eds) (2009) The phylogenetic handbook: a practical
approach to phylogenetic analysis and hypothesis testing, 2nd edn. Cambridge University
Press. ISBN-10: 0521730716. ISBN-13: 978-0521730716
28. Linder CR, Moret BME, Nakhleh L, Warnow T (2004) Network (Reticulate) Evolution: biology,
models, and algorithms. School of Biological Sciences. In, The ninth pacific symposium on
biocomputing
29. Nakhleh L (2010) Evolutionary phylogenetic networks: models and issues. In: Heath L,
Ramakrishnan, N (eds) The problem solving handbook for computational biology and bioin-
formatics. Springer, pp 125–158
30. Nasibov EN, Ulutagay G (2008) FN-DBSCAN: a novel density-based clustering method with
fuzzy neighborhood relations. In: Proceedings of 8th international conference application of
fuzzy systems and soft computing (ICAFS-2008), pp 101–110
31. Robinson DF (1971) Comparison of labeled trees with valency three. J Comb Theory Ser B
11(2):105–119
32. Ropelewski AJ, Nicholas HB, Mendez RR (2010) MPI-PHYLIP: parallelizing computationally
intensive phylogenetic analysis routines for the analysis of large protein families. PLoS ONE
5(11):e13999. doi:10.1371/journal.pone.0013999
33. Ruzgar R, Erciyes K (2012) Clustering based distributed phylogenetic tree construction. Expert
Syst Appl 39(1):89–98
34. Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phy-
logenetic trees. Mol BioI Evol 4(4):406–425
35. Sankoff D (1975) Minimal mutation trees of sequences. SIAM J Appl Math 28:35–42
36. Stamatakis A (2004) Distributed and parallel algorithms and systems for inference of huge phy-
logenetic trees based on the maximum likelihood method. Ph.D. thesis, Technische Universitat,
Munchen, Germany
37. Schmidt HA, Strimmer K, Vingron M, Haeseler A (2002) Tree-puzzle: maximum likelihood
phylogenetic analysis using quartets and parallel computing. Bioinformatics 18(3):502–504
38. Studier J, Keppler K (1988) A note on the neighbor-joining algorithm of Saitou and Nei. Mol
BioI Evol 5(6):729–731
39. Sung W-K (2009) Algorithms in bioinformatics: a practical introduction. CRC Press (Taylor
and Francis Group), Chap 8
40. Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol
24(8):1586–1591
41. Zhou BB, Till M, Zomaya A (2004) Parallel implementation of maximum likelihood meth-
ods for phylogenetic analysis. In: Proceedings of 18th international symposium parallel and
distributed processing (IPDPS 2004)
Epilogue
15
15.1 Introduction
We have described, analyzed, and discussed biological sequence and network prob-
lems from an algorithmic perspective until now. Our aim in this epilogue chapter
is to first observe the bioinformatics problems from a slightly far and more philo-
sophical level and then review the specific challenges in more detail. We start by
coarsely classifying the current bioinformatics challenges in our view. Management
and analysis of big data is one such important task and is confronted in all areas
of bioinformatics. Understanding diseases in search of cures is probably the grand
challenge of all times as we discuss. Finally, bioinformatics education, its current
status, needs, and prospects are stated as other current challenges as there seem to
be different views on this topic.
We then provide specific challenges in technical terms which is in fact a dense
overview of the book showing possible research areas to the potential researcher in
the field; namely, the distributed algorithm designer and implementer in bioinfor-
matics. We conclude by describing the possible future directions we envisage. The
big data will get bigger and efficient methods for handling of big data will be needed
more than before; and understanding mechanisms and searching cures for diseases
will always be a central theme for bioinformatics researchers in collaboration with
other disciplines. Personal medication is the tailoring of care to meet individual
needs and will probably receive more attention than now in the near future as it is
needed for better treatment and also for economical reasons. Population genetics
is already a popular topic of interest for individuals and understanding population
genetical traits will help design better therapies in future. Finally, understanding of
the mechanics and operation of the cell from a system point of view as a whole rather
than its individual parts such as DNA and proteins will probably help to solve many
bioinformatics problems including disease analysis.
The big data is a collection of large and complex data which cannot be efficiently
processed using conventional database techniques. High-throughput techniques such
as next-generation sequencing provided big data in bioinformatics in the last decade.
This data comes in basically as sequence, network, and image data. The basic require-
ment in bioinformatics is to produce knowledge from the raw data. Unfortunately,
our ability to process this data does not grow in proportion with the production of
data. We can classify the management of big data as follows [17]:
• Access and management: The basic requirements for access and management of
big data are its reliable storage, a file system, and efficient and reliable network
access. Client/server systems are typically employed to access data which is dis-
tributed over a network. The file systems can be distributed, clustered, or parallel
[21].
• The Middleware: The middleware resides between the application and the operat-
ing system which manages the hardware, the file system, and the processes. It is
impossible to generate middleware that is suitable for all applications but rather,
some common functionality required by the applications can be determined and the
middleware can then be designed to meet these common requirements. Message
Passing Interface (MPI) can be considered as one such middleware for applications
that require parallel/distributed processing.
• Data mining: Data mining is the process of analyzing large datasets with the aim of
discovering relationships among data elements and present the user with a method
to analyze data further conveniently. In practical terms, a data mining method finds
patterns or models of data such as clusters and tree structures. Clusters provide
valuable information about raw data as we have analyzed in Chaps. 7 and 11. A
data mining method should specify the evaluation method and the algorithmic
process, for example, the quality of clustering in a clustering method.
• Parallel and distributed computing: Given the huge size of data, the paral-
lel/distributed computing is increasingly more required. At hardware level, one can
employ clusters of tightly coupled processors, graphical processing units (GPUs),
or distributed memory processor connected by an interconnection network. Our
approach is using the latter as it is most versatile and can be implemented by many
users more conveniently. We need distributed algorithms to exploit parallel opera-
tions on this huge data and this area has not received significant attention from the
researchers of bioinformatics as we have tried to emphasize throughout this book.
15.2 Current Challenges 353
diseases. A biological network can be represented by a graph and the rich theory
and algorithmic techniques for graphs can be used to analyze these networks. Two
important networks in the cell that are affected by the disease states of an organism
are the protein–protein interaction networks and the gene regulation networks.
Proteins are the fundamental molecules in the cell carrying out vital functions
needed for life. They interact with each other forming protein–protein interaction
(PPI) networks as we have reviewed in Part II and they also interact with DNA
and RNA to perform the necessary cell processes. In a graph representing a PPI
network, nodes represent the proteins and the undirected edges between nodes show
the interaction between the nodes. Protein interactions play a key role to sustain the
healthy states of an organism from which we can deduce their dysfunction may be
one of the sources of disease states.
A mutation in a gene may cause unwanted new interactions in PPI networks such as
with pathogens or protein misfolds to result in diseases. For example, protein misfold-
ing due to mutations may result in lost interactions in a PPI network causing diseases
[6]. Unwanted newly formed protein interactions due to mutations are considered
as the main causes of certain diseases such as Huntington’s disease, Alzheimer’s
disease, and cystic fibrosis. Moreover, some bacterial and viral infections such as
Human papillomavirus may interact with the proteins of the host organism. In order
to understand the mechanisms of disease in the cell, PPI networks can be used to dis-
cover pathways which are sequential biochemical reactions. Finding a subnetwork
of a PPI network to find the corresponding pathway can aid to understand disease
progression [6]. Also, by discovering disease proteins, we can predict the disease
genes.
A functional module in the cell has various interacting components and can be
viewed as an entity with a specific function. Gene expression is controlled by proteins
via regulatory interactions, for example, transcription factor proteins bind to sites near
genes in DNA to regulate them. Such formed networks which have genes, proteins,
other molecules, and their interactions are called gene regulation networks (GRNs)
which can be represented by directed graphs. An edge in a GRN has an orientation,
for example, from the transcription factor to the gene regulated. A GRN is basically
a functional network as opposed to the PPI network which is physical. An important
challenge in a functional network is the identification of the subnetwork associated
with the disease. An effective approach to discover functional modules in the cell is
the clustering process we have seen.
Investigation of the physical and functional networks is needed to identify the
disease-associated subnetworks. For example, discovery of a disease-affected path-
way can be performed by first identifying the mutated genes, then finding the PPI
subnetwork associated with the mutated genes and finally searching for modules
associated with the disease PPI subnetwork to discover the dysfunctional pathways
[2]. In conclusion, the study of PPI networks, functional networks such as GRNs in
the cell using graph-theoretical analysis is needed to understand the disease states of
organisms better.
15.2 Current Challenges 355
of exposure to the preliminary topics described above. A possible remedy for this
situation is to have non-computer science and non-bioinformatics students enroll in
a preparatory year to acquire basic computation and molecular biology knowledge
and skills. Dissertation topics vary but basically we can see a good proportion is
about the design and implementation of algorithms for bioinformatics problems.
There is an increasing interest in analysis of biological networks, occasionally this
research is termed as complex networks with biological networks given as a case
of such networks. Complex networks have large sizes and they commonly have the
small-world property and power-law distribution. Biological networks are a class of
complex networks and the findings in complex networks research is immediately
applicable to networks in the cell in many cases.
The appropriate formation of the curriculums at undergraduate and postgraduate
levels, especially from the algorithmic/tool point of view needs to be considered
carefully. We will not attempt to form a curriculum here as it has already been
done by the International Society of Computational Biology (ISCB). The education
committee of ISCB formed a Curriculum Task Force in 2011 which provided two
reports on the possible curriculum structure [22,23].
We have reviewed the basic concepts in bioinformatics until now from two perspec-
tives: the sequence and network domains. We will review the fundamental problems
here to emphasize the main points, and also to guide the beginning researcher in
the field, namely distributed algorithms for bioinformatics, on the potential research
topics.
once these are determined, we can employ any of the well-known data clustering
algorithms such as hierarchical clustering or k-means algorithm. We saw that there
are only few distributed algorithms for clustering of sequences including the paral-
lel k-means algorithm and we proposed two algorithms for this purpose. Distributed
sequence clustering is a potential research area as there are only few reported studies.
Sequence Repeats: DNA/RNA and proteins contain repeats of subsequences.
These sequences maybe tandem or distributed, and tandem repeats are frequently
found near genes. These repeating patterns may also indicate certain diseases and
therefore finding them is an important area of research in bioinformatics. The meth-
ods for this task can be broadly classified as probabilistic and combinatorial. Prob-
abilistic approaches for distributed repeat (sequence motifs) include MEME and
Gibbs sampling method. Graph-based algorithms are commonly used for combina-
torial sequence motif discovery. Parallel/distributed algorithms are very scarce for
the discovery/prediction of the sequence repeats in DNA/RNA or proteins and this
is again a potential area of research. We proposed two distributed algorithms for this
purpose.
Gene Finding: Gene finding or gene prediction is the process of locating genes
in prokaryotes and eukaryotes. Finding genes in eukaryotes is more difficult due
to the existence of intron regions between the exons in the genes. Hidden Markov
models, artifical neural networks, and genetic algorithms are often used as statistical
approaches for gene finding. We have not been able to find any parallel/distributed
gene finding method in the literature. At first look, partitioning the genome into a
number of processors for gene search can be implemented with ease to speedup
processing.
Genome Rearrangement: The order of subsequences of genomes of organisms
change during evolution. This modification in the genome structure may start new
organisms and may be the cause of some complex diseases. Therefore, it is of interest
to analyze these rearrangements. A common form of these mutations at large scale is
the reversals. We reviewed fundamental algorithms for reversals. Again, we have not
been able to find any parallel/distributed implementation of genome rearrangement
algorithms reported in the literature necessitating further research in this area.
Haplotype Inference: Haplotype inference or genotype phasing refers to the
process of determining data about a single chromosome from the genotype data
of two chromosomes. This is needed to analyze diseases and also to find the evolu-
tionary distances between organisms. Three main methods for this task are Clark’s
algorithm, expectation maximization, and perfect phylogeny haplotyping. We have
not been able to find a distributed haplotype inference algorithm in the literature and
we proposed two distributed algorithms.
network. Clustering, that is, finding dense regions of such networks provides insight
to their operation as these regions may indicate high activity and sometimes disease
areas. We looked at various general purpose graph clustering algorithms and also
analyzed algorithms targeting the biological networks which have very large sizes.
We proposed two distributed algorithms: a modularity-based algorithm and another
one using Markov chain for clustering of biological networks.
Network Motifs: A network motif in a biological network is a statistically over-
represented subgraph of such a network. These subgraph patterns often have some
associated functionality. Finding network motifs helps to discover their functionality
and also the conserved regions in a group of organisms. Since the network size is
very large, sampling algorithms which search for a subgraph pattern in a small
representative part of the target graph are frequently used. A motif m is searched in
a set R of randomly generated graphs with similar properties to the target graph G,
and the occurrence of a motif in G should be statistically much higher than in the
elements of R. Several sequential algorithms for this problem are proposed in various
studies as we have outlined but the distributed algorithms are only few. Therefore,
this is another potential area of research in distributed bioinformatics.
Network Alignment: Alignment of networks has a similar purpose to the align-
ment of sequences, we aim to discover the similarities between graphs or subgraphs
of two or more given biological networks. We can then infer phylogenetic rela-
tionships among them and also, conserved regions can be identified to search their
functionality. In this case, we are trying to discover similar subnetworks rather than
over-represented regions in contrast to network motif discovery. It can be performed
as locally or globally; and pairwise or multiple as in the sequence alignment problem.
Network alignment is closely related to bipartite graph matching problem which can
be solved in polynomial time. Auction algorithms are used for bipartite matching
and distributed versions of these algorithms are reported as we have reviewed. These
are the only few algorithms for this purpose and we also proposed a distributed
approximation algorithm for network alignment.
Phylogeny: Phlogeny is the study of evolutionary relationships among organisms.
Using phylogeny, we can find the disease spreading patterns, discover the origins of
diseases to help design therapies. Phylogenetic trees depict visually the evolutionary
relationships between organisms. Construction of these trees from a given input set of
sequences is a fundamental problem and sequence alignment is frequently employed
to find the similarities of sequences as the first step of building a phylogenetic tree. We
reviewed fundamental sequential methods of phylogenetic tree construction and out-
lined few methods of distributed construction. Phylogenetic networks are relatively
more recent structures to explain the evolutionary process in a more general sense
by considering events such as gene transfer and recombination. We have not been
able to find any reported method for distributed phylogenetic network construction
and therefore conclude this is a potential research area.
Figure 15.1 depicts the hierarchical relationship between these areas of study
in bioinformatics. For example, sequence comparison mainly by alignment is the
front processing of sequential data analysis along with genome rearrangements; and
repeat finding can be used for clustering. Phylogeny methods may or may not use
15.3 Specific Challenges 359
Sequence Network
Clustering SEQUENCE NETWORK Clustering
ANALYSIS ANALYSIS
Phylogenetics
DISEASE EVOLUTION
ANALYSIS ANALYSIS
remain to be investigated. The long-term goal of research may address the problems
described under the future directions in this chapter with keeping an eye on distrib-
uted solutions as these are needed for the problems encountered that always need
high computational power.
Looking at the current challenges, we can anticipate the future challenges and
research in bioinformatics will continue to be in these topics but with possible added
orientations to new subareas. Our expectation is that the bioinformatics education
will be more stabilized than the current status but the main research activities in
algorithmic studies in bioinformatics will probably be on the management of big
data which will grow bigger, and the analysis of diseases and evolution.
Big data will undoubtedly get bigger which means high computational power, par-
allel/distributed algorithms, tools and integrated software support will be needed
more than before [16]. We envisage that there will be significant research oriented
towards parallel/distributed algorithms to solve the fundamental bioinformatics prob-
lems along with analysis of disease. Data mining techniques will be imperative in
the analysis and extracting meaningful knowledge out of big data, with clustering
method which is one of the most investigated areas of research in various disciplines
such as Computer Science, Statistics, and Bioinformatics. Therefore, we may expect
to see advanced data mining methods such as intelligent clustering tailored for bioin-
formatics data. Cloud computing will probably be employed more for the storage,
analysis, and modeling of big data with more advanced user interfaces and more
integrated development tools.
Proteins, genes, and their interactions are so far considered as the major actors of
disease states of an organism. We envisage that the study of biological networks
such as PPI networks and GRNs using algorithmic techniques to understand disease
mechanisms will continue in the foreseeable future. However, it has been observed
that small molecules such as amino acids, sugars, and lipids may also have important
effects on the disease states of an organism than expected before [24]. This relatively
new area is called chemical bioinformatics which deals with the chemical and biolog-
ical processes in the cell. There are a number of databases available related to small
15.4 Future Directions 361
molecules in the cell; the metabolic pathway databases include Kyoto Encyclopedia
of Genes and Genomes (KEGG) [14], the Reactome database [13], and the Small
Molecule Pathway Databases (SMPDB) [5]. It may be foreseen that the analysis of
the role of these small molecules in disease will be needed in future along with the
biological network analysis.
References
1. Apache Hadoop: http://hadoop.apache.org/
2. Cho D-Y, Kim Y-A, Przytycka TM (2012) PLOS computational biology, translational bioinfor-
matics collection volume 8, issue 12, chapter 5: network biology approach to complex diseases
3. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Proceed-
ings of the 6th symposium on operating systems design and implementation: 6–8 Dec 2004;
San Francisco, California, USA, vol 6. ACM, New York, USA, pp 137–150
4. Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB (2011) Bioinformatics
challenges for personalized medicine. Bioinformatics 27(13):1741–1748
5. Frolkis A, Knox C, Lim E, Jewison T, Law V et al (2010) SMPDB: the small molecule pathway
database. Nucleic Acids Res 38:D480–487
6. Gonzalez MW, Kann MG (2012) PLOS computational biology, translational bioinformatics
collection, volume 8, issue 12, chapter 4: protein interactions and disease
7. https://azure.microsoft.com
8. https://www.amazon.com/clouddrive
9. https://www.icloud.com/
10. https://www.dropbox.com/
11. http://www.fda.gov/scienceresearch/specialtopics/personalizedmedicine/
12. https://drive.google.com/drive
13. Joshi-Tope G, Gillespie M, Vastrik I, DEustachio P, Schmidt E, et al (2005) Reactome: a
knowledgebase of biological pathways. Nucleic Acids Res 33:D428–432
14. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M et al (2006) From genomics to
chemical genomics: new developments in KEGG. Nucleic Acids Res 34:D354–357
15. Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL (2009) Searching for SNPs with cloud
computing. Genome Biol 10(11):R134
16. Marx V (2013) Biology: the big challenges of big data. Nature 498:255–260
17. Merelli I, Perez-Sanchez H, Gesing S, DAgostino D (2014) Managing, analysing, and integrat-
ing big data in medical bioinformatics: open problems and future perspectives. BioMed Res
Int 2014, Article ID 134023
18. Olson S, Beachy SH, Giammaria CF, Berger AC (2012) Integrating large-scale genomic infor-
mation into clinical practice. The National Academies Press, Washington
19. Overby CL, Tarczy-Hornoch P (2013) Personalized medicine: challenges and opportunities for
translational bioinformatics. Per Med 10(5):453–462
20. Ranganathan S (2005) Bioinformatics education-perspectives and challenges. PLoS Comput
Biol 1:e52
21. Thanh TD, Mohan S, Choi E, SangBum K, Kim P (2008) A taxonomy and survey on distributed
file systems. In: Proceedings of the 4th international conference on networked computing and
advanced information management (NCM’08), vol 1, pp 144–149
22. Welch LR, Schwartz R, Lewitter F (2012) A report of the curriculum task force of the ISCB
Education Committee. PLoS Comput Biol 8(6):e1002570
23. Welch L, Lewitter F, Schwartz R et al (2014) Bioinformatics curriculum guidelines: toward a
definition of core competencies. PLoS Comput Biol 10(3):e1003496
24. Wishart DS (2012) PLOS computational biology translational bioinformatics collection, vol-
ume 8, issue 12, chapter 3: small molecules and disease
Index
A network alignment
Algorithm, 27, 33 auction, 316
approximation, 45 GRAAL, 310
breadth-first-search, 36 Hoepman, 314
Clark’s, 202 IsoRank, 309
clustering MaWIsh, 309
Batagelj-Zaversnik, 255 PathBlast, 308
Bron–Kerbosch, 253 parallel, 51
HCS, 258, 259 complexity, 55
k-core, 254 phylogenetics
Markov clustering, 263 Fitch’s algorithm, 336
MCODE, 256 Sankoff’s algorithm, 336
spectral, 267 phylogenetic tree
complexity, 33 neighbor joining, 330
depth-first-search, 38 UPGMA, 327
distributed, 51, 69 reccurrence, 34
dynamic programming, 35 shortest path, 39
expectation maximization, 203 Stoye–Gusfield, 164
graph, 36 Yang and Lonardi, 251
breadth-first-search, 36 Alignment, 303
depth-first-search, 38 sequence, 111
minimum spanning tree, 39
shortest path, 39 B
special subgraph, 41 Betweenness
heuristic, 47 edge, 226
k-means, 140 vertex, 225
k-medoid, 141 Bioinformatics
MCL, 264, 266 challenges
minimum spanning tree, 39 big data, 352
motif discovery disease analysis, 353
ESU, 283 education, 355
Grochow–Kellis, 286 network analysis, 357
Kavosh, 285 sequence analysis, 356
mfinder, 282 future directions, 360
MODA, 288 Biological network, 3, 213
Riberio, 294 brain functional, 219
Schatz, 294 cell, 214
Wang, 292 gene regulation, 215
H N
Haplotype, 200 Network
Haplotype inference, 200 motif, 275
Clark’s algorithm, 202 Network alignment, 235, 303
distributed, 204 algorithm
Clark’s algorithm, 204 auction, 316
expectation maximization, 205 GRAAL, 310
EM algorithm, 203 Hoepman, 314
Hidden Markov Models, 186 IsoRank, 309
Human genome project, 23 MaWIsh, 309
PathBlast, 308
I distributed, 311
Independent set, 41 methods, 307
Interconnection network, 52 Network motif, 235, 275
algorithm
L distributed, 291
Local sequence alignment, 118 ESU, 283
Longest common increasing subsequence, 95 Grochow–Kellis, 286
Longest common subsequence, 92 Kavosh, 285
Longest increasing subsequence, 95 mfinder, 282
Longest subsequence, 92 MODA , 288
RAND-ESU , 284
M sampling MODA, 290
Matching, 42 frequency, 279
Matching index, 223 network centric, 282
Metabolic network, 214 random graph, 280
Modularity, 245 statistical significance, 280
Molecular biology, 11 Neural networks
cell, 12 artifical, 188
cloning, 20 NP-completeness, 43
DNA, 13 approximation algorithm, 45
gene, 15 heuristic, 47
genetic code, 18 reduction, 44
human genome project, 23
mutation, 19 O
polymerase chain reaction, 20 Oriented pair, 197
protein, 15
RNA, 14 P
sequencing, 21 Pairwise sequence alignment, 115
transcription, 17 Parallel algorithm
translation, 18 design, 57
Motif Parallel computing, 51, 54
network, 275 algorithm complexity, 55
sequence, 166 architecture, 52
Multicomputer, 53 interconnection network, 52
Multiple sequence alignment, 120 multicomputer, 53
center star method, 121 multiprocessor, 53
progressive, 122 Flynn’s taxonomy, 54
CLUSTALW, 122 multi-threaded, 63
Multiprocessor, 53 POSIX threads, 64
Mutation, 19 parallel random access memory, 55
366 Index
T U
Tandem repeat, 161 UNIX, 66
distributed, 166 distributed, 73
Transcription, 17
Translation, 18