Pattern Discovery in Bioinformatics (L. Parida)
Pattern Discovery in Bioinformatics (L. Parida)
Pattern Discovery
in Bioinformatics
Theory & Algorithms
CHAPMAN & HALL/CRC
Mathematical and Computational Biology Series
Series Editors
Alison M. Etheridge
Department of Statistics
University of Oxford
Louis J. Gross
Department of Ecology and Evolutionary Biology
University of Tennessee
Suzanne Lenhart
Department of Mathematics
University of Tennessee
Philip K. Maini
Mathematical Institute
University of Oxford
Shoba Ranganathan
Research Institute of Biotechnology
Macquarie University
Hershel M. Safer
Weizmann Institute of Science
Bioinformatics & Bio Computing
Eberhard O. Voit
The Wallace H. Couter Department of Biomedical Engineering
Georgia Tech and Emory University
Proposals for the series should be submitted to one of the series editors above or directly to:
CRC Press, Taylor & Francis Group
24-25 Blades Court
Deodar Road
London SW15 2NU
UK
Published Titles
Pattern Discovery
in Bioinformatics
Theory & Algorithms
Laxmi Parida
This book contains information obtained from authentic and highly regarded sources. Reprinted
material is quoted with permission, and sources are indicated. A wide variety of references are
listed. Reasonable efforts have been made to publish reliable data and information, but the author
and the publisher cannot assume responsibility for the validity of all materials or for the conse‑
quences of their use.
No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any
electronic, mechanical, or other means, now known or hereafter invented, including photocopying,
microfilming, and recording, or in any information storage or retrieval system, without written
permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.
copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC)
222 Rosewood Drive, Danvers, MA 01923, 978‑750‑8400. CCC is a not‑for‑profit organization that
provides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Parida, Laxmi.
Pattern discovery in bioinformatics / Laxmi Parida.
p. ; cm. ‑‑ (Chapman & Hall/CRC mathematical and computational biology
series)
Includes bibliographical references and index.
ISBN‑13: 978‑1‑58488‑549‑8 (alk. paper)
ISBN‑10: 1‑58488‑549‑1 (alk. paper)
1. Bioinformatics. 2. Pattern recognition systems. I. Title. II. Series: Chapman
and Hall/CRC mathematical & computational biology series.
[DNLM: 1. Computational Biology‑‑methods. 2. Pattern Recognition,
Automated. QU 26.5 P231p 2008]
QH324.2.P373 2008
572.80285‑‑dc22 2007014582
1 Introduction 1
1.1 Ubiquity of Patterns . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivations from Biology . . . . . . . . . . . . . . . . . . . 2
1.3 The Need for Rigor . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Who is a Reader of this Book? . . . . . . . . . . . . . . . . 3
1.4.1 About this book . . . . . . . . . . . . . . . . . . . . 4
I The Fundamentals 7
2 Basic Algorithmics 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Tree Problem 1: Minimum Spanning Tree . . . . . . . . . . 14
2.3.1 Prim’s algorithm . . . . . . . . . . . . . . . . . . . 17
2.4 Tree Problem 2: Steiner Tree . . . . . . . . . . . . . . . . . 21
2.5 Tree Problem 3: Minimum Mutation Labeling . . . . . . . 22
2.5.1 Fitch’s algorithm . . . . . . . . . . . . . . . . . . . 23
2.6 Storing & Retrieving Elements . . . . . . . . . . . . . . . . 27
2.7 Asymptotic Functions . . . . . . . . . . . . . . . . . . . . . 30
2.8 Recurrence Equations . . . . . . . . . . . . . . . . . . . . . 32
2.8.1 Counting binary trees . . . . . . . . . . . . . . . . . 34
2.8.2 Enumerating unrooted trees (Prüfer sequence) . . . 36
2.9 NP-Complete Class of Problems . . . . . . . . . . . . . . . 40
2.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Basic Statistics 47
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Basic Probability . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Probability space foundations . . . . . . . . . . . . 48
3.2.2 Multiple events (Bayes’ theorem) . . . . . . . . . . 50
3.2.3 Inclusion-exclusion principle . . . . . . . . . . . . . 51
3.2.4 Discrete probability space . . . . . . . . . . . . . . 54
3.2.5 Algebra of random variables . . . . . . . . . . . . . 57
3.2.6 Expectations . . . . . . . . . . . . . . . . . . . . . . 58
3.2.7 Discrete probability distribution (binomial, Poisson) 60
3.2.8 Continuous probability distribution (normal) . . . . 64
3.2.9 Continuous probability space (Ω is R) . . . . . . . 66
3.3 The Bare Truth about Inferential Statistics . . . . . . . . . 69
3.3.1 Probability distribution invariants . . . . . . . . . . 70
3.3.2 Samples & summary statistics . . . . . . . . . . . . 72
3.3.3 The central limit theorem . . . . . . . . . . . . . . 77
3.3.4 Statistical significance (p-value) . . . . . . . . . . . 80
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
References 503
Index 515
Acknowledgments
I owe the completion of this book to the patience and understanding of Tuhina
at home and of friends and colleagues outside of home. I am particularly
grateful for Tuhina’s subtle, quiet cheer-leading without which this effort may
have seemed like a thankless chore.
Behind every woman is an army of men. My sincere thanks to Alberto Ap-
sotolico, Saugata Basu, Jaume Bertranpetit, Andrea Califano, Matteo Comin,
David Gilbert, Danny Hermelin, Enam Karim, Gadi Landau, Naren Rama-
krishnan, Ajay Royyuru, David Sankoff, Frank Suits, Maciej Trybilo, Steve
Oshry, Samaresh Parida, Rohit Parikh, Mike Waterman and Oren Weimann
for their, sometimes unwitting, complicity in this endeavor.
Chapter 1
Introduction
1
2 Pattern Discovery in Bioinformatics: Theory & Algorithms
the domain specifications, but a tacit acknowledgment that each domain de-
serves much closer attention and elaborate treatment that goes well beyond
the scope of this book. Also, this approach compels us to take a hard look at
the problem domain, often giving rise to elegant and clean definitions with a
sound mathematical foundation as well as efficient algorithms.
∗∗
challenging exercises (and sections) are marked with .
The Fundamentals
Chapter 2
Basic Algorithmics
2.1 Introduction
To keep the book self-contained, in this chapter we review the basic mathe-
matics and algorithmics that is required to understand and appreciate the ma-
terial in the remainder of the book. To give a context to some of the abstract
ideas involved, we follow a storyline. This is also an exercise in understanding
an application, abstracting the essence of the task at hand, formulating the
computational problem and designing and analyzing an algorithm for solving
the computational problem. The last (but certainly not the least) part of
the task is to implement the algorithm and analyze its performance in a real
world setting.
Consider the following scenario. Professor Marwin has come across a gem
of a bacterium: a chromo-bacterium that is identified by its color and every
time a mutation occurs, it takes on a new color. Leaving her bacteria culture
with a generous source of nutrition, she proceeds on a two week vacation. On
her return, she is confronted with K distinct colored strains of bacteria. A
closer look reveals that each bacterium has a genome of size m and with only
mutational differences between the genomes. She is intrigued by this and is
eager to reconstruct the evolution of the bacteria in her absence.
2.2 Graphs
A graph is a common abstraction that is used to model binary relationships
between objects. More precisely, a graph G is a pair (V, E), consisting of a
set V and a subset
E ⊂ (V × V )
9
10 Pattern Discovery in Bioinformatics: Theory & Algorithms
G(V, E) with |V | = K
The task is to find a connected graph G∗ with minimum weight (over all pos-
sible connected graphs on V ).
Basic Algorithmics 11
s1
CGACTCGCAT 4 s3
CCATCCGCAC
2
4
5
CGACCCGCCT 5 3
s2 3
5
5
CCAGCCCCAT
2 s5
CCATTCCCAT
s4
FIGURE 2.1: The graph constructed with the data from Example (1). A
vertex vi is labeled with sequence si , 1 ≤ i ≤ 5. The weight associated with
edge (vi vj ) is the distance dij .
The edge weight dij denotes the distance between two genomes si , sj , or the
number of mutations that takes genome si to sj or vice-versa. Clearly, all
the K colored groups of bacteria that Professor Marwin observes did not
evolve independently but from the one strain that she started with. Thus
the connectivity of the graph G, where each node is a genome, represents
how the genomes evolved over a period of time. Given different choices of
graphs that explain the evolution process, the one(s) of most interest would
be the ones that have the smallest weight since that would be considered as
the simplest explanation of the process. Consider an instance of this problem
in the following example.
s1 = CGACTCGCAT
s2 = CGACCCGCCT
s3 = CCATCCGCAC
s4 = CCATTCCCAT
s5 = CCAGCCCCAT
{(v1 v2 ) ∈ E | v1 , v2 ∈ V ′ }.
Subgraph
G′ (V ′ , E ′ )
is induced by E ′ when given E ′ ⊆ E, V ′ is defined as
{vi ∈ V | (vi , vj ) ∈ E ′ }.
v1 , v2 ∈ V
is a sequence of vertices,
E = φ.
Then
wt(G) = 0
and this is the smallest possible weight of a graph. But the graph is not con-
nected, and hence this is an incorrect solution. However this failure suggests
that the solution needs to introduce as small a number of edges as possible so
as to make the graph connected. Seeking the smallest (or simplest) explana-
tion for a problem is often called the Occam’s razor principle or the principle
of parsimony.
Another way to look at the problem is to start with a completely connected
graph
G(V, E),
i.e, each
(vi vj ) ∈ E where i 6= j,
Basic Algorithmics 13
and remove the redundant edges. Some careful thought leads to the con-
clusion, that this graph must have no cycles, i.e., no closed paths. This is
because an edge can be removed from the cycle, maintaining the connectivity
of the graph but reducing the number of edges. These observations lead to
the following definition.
G(V, E)
v1 , v2 ∈ V
Does every tree have internal nodes? Consider a tree with only two vertices,
all the vertices have only one incident edge. Hence this tree has no internal
edge. Does every tree have a leaf node? The following lemma guarantees that
it does.
LEMMA 2.1
(mandatory leaf lemma) A tree T (V, E) must have a leaf node.
PROOF If
|V | ≤ 2,
clearly all the nodes are leaves. Assume
|V | > 2.
V = {v1 , v2 , v3 , . . . , vn }.
Consider an arbitrary
vi0 ∈ V.
If vi0 is a leaf node, we are done. Assume it is not. Consider vi1 where
(vi0 vi1 ) ∈ E.
Again if vi1 is not a leaf node, since vi1 has degree at least two, there exists
vi2 6= vi0
14 Pattern Discovery in Bioinformatics: Theory & Algorithms
with
(vi1 vi2 ) ∈ E.
Since the graph is finite and has no cycles 1 this process must terminate with
a leaf node.
1. initializing a count to 0,
1 Consider the sequence of vertices in the traversal: vi0 vi1 vi2 vi3 . . . vik . If a vertex appears
multiple times in the traversal, then clearly the graph has a cycle.
Basic Algorithmics 15
of this function with m. To avoid clutter and focus on the most meaningful
factor, we use an asymptotic notation (see Section 2.7) and the time is written
as
ca + m(2cr + cc + ca ) = O(m)
We use a big-oh notation O(·) which is explained in Section 2.7. In other
words, all the constants are ignored. The time grows linearly with the size m
and that is all that matters.
Since there are K distinct sequences,
K(K − 1)
−1
2
comparisons are made. Again using the asymptotic notation, ignoring linear
factors (K) in favor of faster growing quadratic factors (K 2 ), the number of
comparisons is written as
O(K 2 ).
Thus since
O(K 2 )
comparisons are made and each comparison takes
O(m)
O(K 2 m).
LEMMA 2.2
(Edge-vertex lemma) Given a connected graph G(V, E),
(G is a tree) ⇒ |E| = |V | − 1.
COROLLARY 2.1
Given a tree T (V, E), removing an edge (keeping V untouched) disconnects
the tree.
Back to the algorithm. Since this algorithm involves searching the entire
space of spanning trees, we next count the total number of such trees with K
vertices given as N . This number N is given by Cayley’s formula:
N = K (K−2) .
However, this is not the end of the story, since we must devise a way to
enumerate these N configurations. This is discussed in Section 2.8.2. To
summarize, the time taken by the naive algorithm is
O(K K + mK 2 ),
G(V, E)
T (V ′ , E ′ )
1. V ′ = V ,
2. E ′ ⊆ E and
3. T is a tree.
LEMMA 2.3
(Bridge lemma) Given graph G(V, E), let
Es ⊆ E
(E \ Es )
is such that
(v1 v2 ) ∈ (Es ∩ E ∗ )
where
T ∗ (V, E ∗ )
is a minimum spanning tree.
V = V1 ∪ V2
with
V1 ∩ V2 = φ.
Assume the result is not true i.e.,
(v1 v2 ) 6∈ E ∗ .
(v1′ v2′ ) ∈ Es
with
wt(v1′ v2′ ) > wt(v1 v2 ).
Construct T ′ from T ∗ by deleting
(v1′ v2′ ).
18 Pattern Discovery in Bioinformatics: Theory & Algorithms
Next we add the edge (v1 v2 ) to T ′ that now connects the subgraphs T1′ and
T2′ without introducing cycles since v1 ∈ V1 and v2 ∈ V2 , without loss of
generality. As a result T ′ is acyclic and connected, hence a tree. But
(v1 v2 ) ∈ E ∗ .
LEMMA 2.4
(Weakest link lemma) The converse of Lemma (2.3) also holds true. In
other words, given a minimum spanning tree
T ∗ (V, E ∗ )
v1 v2 ∈ E ∗ ,
and let the two connected components of T ∗ obtained by deleting the edge
(v1 , v2 ) be
Tk∗ (Vk∗ , Ek∗ ), k = 1, 2.
Gk (V k , E k ) are the two subgraphs of G(V, E) induced by Ek∗ with
v1 ∈ V 1 and v2 ∈ V 2 .
Then
wt(v1 v2 ) = min wt(vi vj )
(vi vj )∈E,vi ∈V 1 ,vj ∈V 2
From lemma to algorithm. This lemma and its converse can be used to de-
sign a straightforward algorithm (Algorithm (1)): we progressively construct
an Es in every step as we build E ∗ and V ∗ . This algorithm is also called
Prim’s algorithm [CLR90]. It is now straightforward to prove the correctness
of the algorithm.
LEMMA 2.5
Algorithm (1) correctly computes the minimum spanning tree.
Basic Algorithmics 19
Es 6= φ
since
|V ∗ | ≤ |V |.
Thus exactly one edge is added to E ∗ . By Lemma (2.2), T ∗ has
|V | − 1
edges and the algorithm is iterated (|V | − 1) times. Thus T ∗ has (|V | − 1)
edges and by Lemma (2.3), each edge added to E ∗ is in the minimum spanning
tree, hence the algorithm is correct.
Figure 2.2 illustrates the algorithm on the graph of Example (1). Using the
asymptotic notation, Step (0) takes time
O(1).
Step 1 takes
O(|E|)
time, Step (2) takes O(1) and Step 3 takes
O(E)
time. Since Steps 1-3 are repeated |V | times, the algorithm takes time
O(|V ||E|).
The running time complexity can be improved by using efficient data struc-
tures for Step (3) of the algorithm. However, for this exposition we stay
content with time complexity of
O(|V ||E|).
FOR i = 1 . . . to (|V | − 1)
(1) Let wt(v1 v2 ) = min(vi vj )∈Es wt(vi vj )
(2) V ∗ ← V ∗ ∪ {v1 , v2 }, E ∗ ← E ∗ ∪ {(v1 , v2 )}
(3) Es ← {(vi vj ) | (vi ∈ V ∗ ) AND (vj 6∈ V ∗ )}
20 Pattern Discovery in Bioinformatics: Theory & Algorithms
s1 4 s1 4
s3 s3
2 2
4 4
5 5
s2 5 3 s2 5 3
3 3
5 5
5 5 s5
s5
2 2
s4 s4
(1) (2)
s1 4 s1 4
s3 s3
2 2
4 4
5 5
s2 3 s2 5 3
5
3 3
5 5
5 5 s5
s5
2 2
s4 s4
(3) (4)
FIGURE 2.2: Consider the graph of Figure 2.1. The minimum spanning
tree (MST) algorithm on this graph is shown here. (1)-(4) denote the steps
in the algorithm. At each step the edges shown in bold are the ones that
constitute the spanning tree and the collection of edges Es is shown as dashed
edges. The MST is shown by the bold edges in (4).
Basic Algorithmics 21
si , 1 ≤ i ≤ K,
each of size m each, let dij be the number of base differences between sequences
si and sj . Consider a weighted graph
G(V, E)
with
|V | ≥ K
and edge weights
wt(vi vj ) = dij where vi , vj ∈ V.
Weight of G(V, E) is defined as:
X
WG = wt(vi vj ).
(vi vj )∈E
T ∗ (V, E)
2
CGACCCGCAT CCATCCGCAT
1 1 1 1
CGACTCGCAT CCATCCCCAT
s1 CGACCCGCCT CCATCCGCAC
s2 1 1 s3
CCATTCCCAT CCAGCCCCAT
s4 s5
FIGURE 2.3: The tree of minimum weight with the given vertices as leaf
nodes labeled as s1 to s5 .
T b(K),
T b(K) ≤ T a(K).
2 In a rooted tree, the only exception is the root which has no parents.
Basic Algorithmics 23
LEMMA 2.6
(Two-tree partition) Let
T (V, E)
be a tree with subtrees
such that
V1 ∩ V2 = φ,
and
V = V1 ∪ V2 ∪ {v0 } where v0 6∈ V1 , V2 .
Further
E = E1 ∪ E2 ∪ {v0 v1 , v0 v2 },
for fixed
vi ∈ Vi , i = 1, 2.
Let the minimal weight of a tree T ′ be given as
Wopt (T ′ ).
3 Strictly
speaking, Fitch’s algorithm was presented for a rooted bifurcating (binary) tree.
We have generalized the principle here.
24 Pattern Discovery in Bioinformatics: Theory & Algorithms
Then
PROOF We are given that the labeling of T1 and T2 are optimal. Clearly
the following is not possible
If
L(v1 ) ∩ L(v2 ) 6= φ,
then by the labeling
L(v0 ) ← L(v1 ) ∩ L(v2 ),
we get
Wopt (T ) = Wopt (T1 ) + Wopt (T2 ),
which is clearly the optimal, since the weight cannot be improved (reduced)
any further.
If
L(v1 ) ∩ L(v2 ) = φ,
then by the labeling
L(v0 ) ← L(v1 ) ∪ L(v2 ),
we get
Wopt (T ) = Wopt (T1 ) + Wopt (T2 ) + 1.
Again this is clearly optimal, since if it were not, then there exists a labeling
of T1 , say, such that the new weight is less than the given weight, which is a
contradiction.
The following lemma, is a more general form of Lemma (2.6) in the sense
that the number of partitioning subtrees can be larger than 2.
LEMMA 2.7
(Multi-tree partition) Let
T (V, E)
be a tree with subtrees
T (Vi , Ei ), 1 ≤ i ≤ p
such that
Vi ∩ Vj = φ, for 1 ≤ (i 6= j) ≤ p,
Basic Algorithmics 25
Also,
[ [
E= Ei {v0 vi } ,
i
The arguments for this lemma are along the lines of that of Lemma (2.6)
and we skip the details here as they give no further insight into the problem
than we already have.
From lemma to algorithm. Do the lemmas give an indication of the algo-
rithm that can solve the problem? Actually, they do. This is a classic case of
obtaining the optimal solution for a problem using optimal solutions to the
subproblems. The task is to break the given tree into subtrees which can be
labeled optimally and then build from there. The starting point is the collec-
tion of trees, the singleton nodes, the leaf nodes, which are optimally assigned
by the given problem.
However, there is one catch. While we solve the subproblems, it is important
to keep track of all the possible labelings of the roots, vi of the subtrees Ti .
The lemmas state that W (T ) is optimal but what is the guarantee that there
is no other labeling of the nodes that gives the optimal solution? For example
a suboptimal labeling of the internal nodes of T1 such that
could give the optimal labeling for T . Let the suboptimal labeling be denoted
by L′ . Then
L′ (v1 ) ∩ L(v1 ) = φ
and at least one other vertex
v1′ 6= v1 ∈ V1
26 Pattern Discovery in Bioinformatics: Theory & Algorithms
is such that
L′ (v1′ ) 6= L(v1′ ).
v1′ ∈ V1
does not matter. This is because T1 is a subtree and the only node that
connects it to the remainder of the tree is v1 , and hence v1′ will never be
considered in the future as well. This implies that the algorithm
2. gives some optimal labelings but not all optimal labelings of the internal
nodes.
Depth(v) ← 0
and each nonleaf node v is assigned a depth Depth(v) as the shortest path
length to any leaf node. Let maxdepth be the maximum depth assigned to
any vertex in the tree.
FOR d = 1, 2, . . . , maxdepth DO
FOR EACH v ∈ V with depth(v) = d
U ← {u | (uv) ∈ E, (depth(u) < d)}
v(σ) = {u | (u ∈ U ) AND (σ ∈ L(u))}
L(v) ← {σ ′ | v(σ ′ ) = maxσ∈Σ |v(σ)|}
W t ← W t + (|U | − maxσ∈Σ |v(σ)|)
ENDFOR
This algorithm is simple but not necessarily the most efficient. The optimal
weight is computed in the variable W t. The depth of the nodes need not
be explicitly computed and the tree can be traversed bottom-up from the
leaves. For rooted bifurcating trees, this is also known as Fitch’s algorithm.
Figure 2.4 gives an example illustrating this simple and elegant algorithm.
Basic Algorithmics 27
1
G G C C
G
0 0 0 0 0 0 0 0
0 0
C C C
C G G C G G C G G C
G G
0 0 0 0 0 0
C C C C C C C C
(a) (b) (c) (d)
CCAGCCCCAT CCAGCCCCAT
CCATCCGCAC CCATCCGCAC
CCATCCGCAT
CCATTCCCAT CCATTCCCAT
CGACCCGCCT CGACCCGCCT
CGACTCGCAT CGACTCGCAT
(1) Doubly linked list (2) Adding a new element
FIGURE 2.5: Doubly liked linear list: Each element in the linked list has
a pointer to the previous and the next element. The element shown in the
dashed box is a new element that is added in lexicographic order as shown.
This takes O(n) time.
CCATTCCCAT
CCATCCGCAC CGACCCGCCT
log n
CCAGCCCCAT CGACTCGCAT
(1) A balanced binary tree
CCATTCCCAT
CCATCCGCAC CGACCCGCCT
log n
CCATCCGCAT
CCAGCCCCAT CGACTCGCAT
(2) Inserting a new element in the tree
FIGURE 2.6: The elements stored in a balanced binary tree. The element
shown is a dashed box is the element being added to the tree, maintaining its
balanced property. This takes O(log n) time.
Basic Algorithmics 29
then the element is possibly in the second collection, otherwise it’s possibly
in the first. We repeat this process until either ei is found or we run out of
subcollections to search. The time taken for this process can be computed
using the following recurrenceequation
T (n) = 1 if n = 1
T n2 + O(1) if n > 1
Asymptotically,
T (n) = O(log n).
See Section 2.7 and Exercise 7.
Next Professor Marwin realizes that a new sequence needs to be added very
often. It seem like a waste of effort to redo the whole computation every time.
The problem is modified as follows.
denote the height of the tree rooted at its left child. Let the height of the
right child be
p
Hright .
A tree is balanced if for every node p,
p p
|Hright − Hlef t | ≤ 1.
In other words a difference of at most one is allowed in the height of the left
and the right subtrees of a node. In Figure 2.6, the root r (the vertex at the
top) has
r r
Hright = Hlef t = 2.
30 Pattern Discovery in Bioinformatics: Theory & Algorithms
Every other node has a difference of one in the heights of its left and right
child.
Thus when a new element is added, the tree has to remain balanced. The
height of the tree is
O(log n).
In the example in Figure 2.6, the element is added in the correct lexicographic
order and the tree continues to be balanced. In general, it is possible to make
some local adjustments so that the tree is balanced. This balancing can be
done in time
O(log n).
Thus inserting an element in a balanced binary tree takes time
O(log n).
Examples of such data structures are AVL trees (named after the authors
Adelson-Velskii and Landis), 2-3 trees, B-trees, red-black trees and splay-
trees [CLR90].
g = 205.42332122132122.
Let an acceptable approximation be f with only two digits after the decimal
and 4
|g − f | ≤ 0.005.
The feature here is apparently the usability with acceptable monetary units.
Denote this ‘approximation’ as Υ, then this can be written as
g = Υ(f ).
4 If
g was obtained as the interest computed for thirteen months on a sum of money and
had to be paid to a client, then f = 205.42 is an acceptable approximation.
Basic Algorithmics 31
in studying the growth of this function with n- this is called the asymptotic
behavior of the function
g(n).
In the context of algorithm analysis, various approximations are of interest,
some of which are listed in Figure 2.7. We show five different forms of Υ:
O(f (n)),
is the most commonly used asymptotic notation and we take a closer look at
its definition. O(f (n)) is the set of functions g(n) such that there exist (∃)
positive constants c and n0 , satisfying
0 ≤ g(n) ≤ cf (n),
∃ c0 , 106 n +
(e) Θ(f (n)) 0 ≤ c0 f (n) ≤ g(n) ≤ c1 f (n) n0
c1 , n 0 1010 = Θ(n)
O(f (n))
For more intricate instances, refer to the definitions in Figure 2.7. A picto-
rial representation of some of the functions is shown in Figure 2.8 for conve-
nience.
F ac(n) = n!
1, 2, 3, . . . , n.
Basic Algorithmics 33
cf(n) c2f(n)
g(n) g(n) g(n)
cf(n) c1f(n)
n n n
n0 n0 n0
g(n) = O(f (n)) g(n) = Ω(f (n)) g(n) = Θ(f (n))
0, 1, 1, 2, 3, 5, 8, 13, 21, . . . .
F ib(n),
A recurrence form of this kind serves a concise way of defining the function.
But a closed form is more useful for algebraic manipulations and comparison
with other forms. The closed form expression for the Fibonacci number is:
" √ !n √ !n #
1 1+ 5 1− 5
F ib(n) = √ − .
5 2 2
F ib(n) = O(cn ),
where √ !
1+ 5
c= .
2
Problem 7 (Counting binary trees) Let the leaf nodes of a binary tree be
labeled from
1, 2, . . . K.
How many such trees can be constructed?
Let
T b(K)
denote the number of binary trees with K leaf nodes. We first compute
E(K),
E(2) = 1.
Given a binary tree with K leaf nodes, if a new leaf node is to be added to
the tree, the leaf node with an edge has to be always attached to the middle
of an existing edge. Thus the existing edge is lost and three new edges are
added, effectively increasing the number of edges by 2. Thus, for K > 2,
E(K + 1) = E(K) + 2.
1, 2, . . . , K,
This process is continued until only two vertices remain on the tree. The
sequence of labels
p1 p2 . . . pK−3 pK−2
is called the Prüfer sequence. Notice that repetitions are allowed in the se-
quence, i.e., it is possible that for some
1 ≤ i, j ≤ K − 2,
4 3
4
1 6 2
6
2 3 5 1 5
p = 1114 p = 2346
FOR i = 1, 2, . . . , l DO
(1) m ← min L
(2) Introduce edge (m, p[i]) in E
(3) L ← L \ {m}
(4) If p[i] 6∈ Π(p[i + 1 . . . l]) L ← L ∪ {p[i]}
ENDFOR
(5) Introduce edge (m1 , m2 ) in E where m1 , m2 ∈ L
corresponding to the two vertices in L and these get connected in the last
step. Thus the constructed graph is connected.
Further, an edge is constructed in each iteration and the number of itera-
tions is K −1, hence the graph has K −1 edges. A connected graph with K −1
edges on K nodes is a tree (see Exercise 2). This proves that the algorithm
is correct. 2
The theorem below follows directly from Algorithm (3) and its proof of
correctness.
THEOREM 2.1
5
(Prüfer’s theorem) There is a bijection from the set of Prüfer sequences
of length
K −2
on
V = {1, 2, . . . , K}
to the set of trees T with vertex set V . In other words, a Prüfer sequence
corresponds to a unique tree and vice-versa.
COROLLARY 2.2
The number of trees on K nodes is (Cayley’s number):
K K−2 .
4 4
1 7 8 1 7 8
2 3 5 6 2 3 5 6
(a) p = 111484 L = {2, 3, 5, 6, 7} (b) p = 1 11484 L = { 2 , 3, 5, 6, 7}
4 4
1 7 8 1 7 8
2 3 5 6 2 3 5 6
(c) p = 1 1484 L = { 3 , 5, 6, 7} (d) p = 1 484 L = { 5 , 6, 7}
4 4
1 7 8 1 7 8
2 3 5 6 2 3 5 6
(d) p = 4 84 L = { 1 , 6, 7} (e) p = 8 4 L = { 6 , 7}
4 4
1 7 8 1 7 8
2 3 5 6 2 3 5 6
(f) p = 4 L = { 7 , 8} (g) p = φ L = {4, 8}
O(nc ),
is called tractable, where n is the size of the input and c is some constant. So,
is the Steiner tree problem really not tractable or were we not smart enough
to find one?
Theoretical computer scientists study an interesting class of problems, called
the NP-complete 6 problems, whose tractability status is unknown. No poly-
nomial time algorithm has been discovered for any problem in this class, to
date. However, the general belief is that the problems in this class are in-
tractable. Needless to mention, this is the most perplexing open problem in
the area.
Notwithstanding the fact that the central problem in theoretical computer
science remains unresolved, techniques have been devised to ascertain the
tractability of a problem using relationships between problems in this class.
Suppose we encounter Problem X. First we need to check if this problem
has been studied before. A growing compendium of problems in the class
of NP-complete problems exist and it is very likely that a new problem one
encounters is identical to one of this collection. For example, our Problem (3)
was identified to be the Steiner tree problem.
If Problem X cannot be identified with a known problem, then the next step
is to reduce a problem, Problem A, from the NP-complete class in polynomial
time to Problem X. The reduction is a precise process that demonstrates
that a solution to Problem A can be obtained from a solution to Problem X
in polynomial time. This proves that Problem X also belongs to the class of
NP-complete problems.
6 NP stands for ‘Non-deterministic Polynomial’ and any further discussion requires a fair
understanding of formal language theory and is beyond the scope of this exposition.
Basic Algorithmics 41
Summary
The chapter introduces the reader to a very basic abstraction, trees. The
link between pattern discovery and trees will become obvious in the later
chapters. To understand and appreciate the issues involved with trees, we
elaborate on three problems: (1) Minimum spanning tree, (2) Steiner tree and
(3) Minimum mutation labeling. The first and the third have a polynomial
time solution. The reader is also given a quick introduction to using the same
abstraction as a data structure (balanced binary trees). Recurrence equation
is a simple yet powerful tool that helps in counting and Prüfer sequences are
invaluable in enumerating trees.
This was a brief introduction to the exciting world of algorithmics. In my
mind, this is the foundation for a systematic subject such as bioinformatics.
The beauty of this field is that some very powerful statements (that will be
used repeatedly elsewhere in the book), such as the number of internal nodes
in a tree is bounded by the number of leaf nodes, are consequences of very
simple ideas (see Exercise 4). The intent of the chapter has been to introduce
the reader to basic concepts used elsewhere in the book as well as influence
his or her thought processes while dealing with computational approaches to
biological problems.
2.10 Exercises
Exercise 1 (mtDNA) DNA in the mitochondria (mtDNA) of a cell traces
the lineage of a mother to child. A health center that gathers mtDNA data for
families to help trace and understand hereditary diseases accidentally mixes
up the mtDNA data losing all lineage information for a family with seven
7 Though in theory it is possible since the central tractability question is still unresolved, in
practice, it is extremely unlikely.
42 Pattern Discovery in Bioinformatics: Theory & Algorithms
generations. The entire sequence information of the mtDNA for each member
is accessible from a database.
(G is a tree) ⇐ (|E| = |V | − 1)
Exercise 3 (Tree leaf nodes) Show the a tree T (V, E) with |V | > 1 must
have at least l = 2 leaf nodes.
Does the statement hold for l = 3 (assume |V | > 2)?
(v 6= vf ) ∈ V.
Note that v with the largest distance from vf must be a leaf node.
Exercise 4 (Linear size of trees) Given a tree T (V, E), let l be the number
of leaf nodes. Show that
|V | ≤ 2l.
Exercise 6 Show that Equation (2.2) is the solution to the recurrence equa-
tion:
1 if K = 2
F (K) =
(2K − 5)F (K − 1) if K > 2
where
1 if n = 1,
T (n) =
T (n/2) + O(1) if n > 1.
T b(k) = (O (k))k ,
where
(2k − 4)!
T b(k) =
2k−2 (k− 2)!
Hint: Use Stirling’s approximation (Equation (2.1) for the factorials in T b(k).
Given two vertices v1 , v2 ∈ V , the distance between the two is given as the
maximum path length over all possible paths between v1 and v2 , i.e.,
(b)
S ← S \ {x}
Basic Algorithmics 45
Hint: Let
M (k, d), d ≤ k,
be the total number of vertices at depth d in ALL rooted trees with k nodes.
M (0, 1) = 1,
M (1, 1) = 1,
M (d − 1, k − 1) + kM (d, k − 1) for 0 ≤ d ≤ k,
M (d, k) =
0 otherwise.
1. M (0, k), k ≥ 1, is the total number of all possible trees (not necessarily
distinct):
M (0, k) = k!
2. Average distance, µ(k), of a node from the root on the tree with k nodes:
Pk
d=1dM (d, k)
µ(k) =
kM (0, k)
Pk
d=1 dM (d, k)
= .
k × k!
46 Pattern Discovery in Bioinformatics: Theory & Algorithms
k\d −1 0 1 2 3 4 5 6 - - µ(k)
1 0 1 1 0 1
2 0 2 3 1 0 1.25
3 0 6 11 6 1 0 1.44
4 0 24 50 35 10 1 0 1.60
5 0 96 274 225 85 15 1 0 2.17
6 0 - - - 735 175 21 1 0 -
7 0 - - - - - 322 13 1 0 -
Chapter 3
Basic Statistics
3.1 Introduction
To keep the book self-contained, in this chapter we review some basic statis-
tics. To navigate this (apparent) complex maze of ideas and terminologies,
we follow a simple story-line.
Professor Marwin, who kept us busy in the last chapter with her algorithmic
challenges, also tends a Koi 1 pond with four types of this exquisite fish: Asagi
(A), Chagoi (C), Ginrin Showa (G) and Tancho Sanshoku (T). The fish have
multiplied tremendously for the professor to keep track of their exact number
but she claims that the four types have an equal representation in her large
pond.
She is introduced to a scanner (a kind of fish net or trap) that catches
no more than one at each trial i.e., zero or one koi. The manufacturer sells
the scanners in a turbo mode, as a k-scanner where k(≥ 1) scanners operate
simultaneously yielding an ordered sequence of k results. The professor further
asserts that each fish in the pond is equally healthy, agile and alert to avoid
a scanner, thus having equal chance of being trapped (or not) in the scanner.
We study the relevant important concepts centered around this koi pond
scenario. We give a quick summary below of the questions that motivate the
different concepts.
47
48 Pattern Discovery in Bioinformatics: Theory & Algorithms
(Ω, F, P ) (3.1)
ω ∈ Ω,
{ω} ∈ F
Ω = Rn
P :F →R
1. For each E ∈ F,
P (E) ≥ 0.
We leave it as an exercise for the reader (Exercise 12) to show that under
these conditions, for each E ∈ F,
0 ≤ P (E) ≤ 1. (3.3)
Bayes’ rule. The Bayesian approach is one of the most commonly used
methods in a wide variety of applications ranging from bioinformatics to com-
puter vision. Roughly speaking, this framework exploits multiply occurring
events in observed data sets by using the occurrence of one or mote events to
(statistically) guess the occurrence of the other events. Note that there can
be no claim on an event being either the cause or the effect.
The simplicity of the Bayesian rule is very appealing and we discuss this
below.
Joint probability is the probability of two events in conjunction. That is, it
is the probability of both events together. The joint probability of events E1
and E2 is written as
P (E1 ∩ E2 )
or just
P (E1 E2 ).
Going back to the foundations, Kolmogorov axioms lead to the natural con-
cept of conditional probability. For E1 with
P (E1 ) > 0,
Basic Statistics 51
P (E2 |E1 ),
is defined as follows:
P (E1 ∩ E2 )
P (E2 |E1 ) = .
P (E1 )
In other words, conditional probability is the probability of some event E2 ,
given the occurrence of some other event E1 .
In this context, the probability of an event say E1 is also called the marginal
probability. It is the probability of E1 , regardless of event E2 . The marginal
probability of E1 is written P (E1 ).
Bayes’ theorem relates the conditional and marginal probability distribu-
tions of random variables as follows:
THEOREM 3.1
(Bayes’ theorem) Given events E1 and E2 in the same probability space,
with
P (E2 ) > 0,
P (E2 |E1 )
P (E1 |E2 ) = P (E1 ). (3.4)
P (E2 )
The proof falls out of the definitions and the result is often interpreted as:
Likelihood
Posterior = Prior.
normalization factor
We will pick up this thread of thought in a later chapter on maximum likeli-
hood approach to problems.
2. P (E1 ∩ E2 ) = 0.
52 Pattern Discovery in Bioinformatics: Theory & Algorithms
Mutually exclusive events are also called disjoint. It follows that if E1 and E2
are mutually exclusive, then the conditional probabilities are zero. i.e.,
E1 ∩ E2 6= ∅ ?
E1 ∩ E2 = ∅
and
P (E1 )P (E2 ) 6= 0,
then the events are (very) dependent.
Mathematically speaking, two events E1 and E2 are independent if and only
if the following hold:
Union of events. Using the Kolmogorov axioms one can deduce the fol-
lowing:
P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 ).
For a natural generalization to
P (E1 ∪ E2 ∪ . . . ∪ En ),
Basic Statistics 53
define the following quantities: Sl is the sum of the probabilities of the inter-
section of all possible l out of the n events. Thus
X
S1 = P (Ei ),
i
X
S2 = P (Ei ∩ Ej ),
i<j
X
S3 = P (Ei ∩ Ej ∩ Ek ),
i<j<k
and so on.
THEOREM 3.2
(Inclusion-exclusion principle)
P (E1 ∪ E2 ∪ . . . ∪ En ) = S1 − S2 + S3 − S4 + S5 . . . + (−1)n+1 Sn
Xn
= (−1)i+1 Si .
i=1
hence,
≥ 0 for k odd,
(−1)k Sk + (−1)k+1 Sk+1 (3.5)
≤ 0 for k even.
This implies that (the proof is left as an exercise for the reader):
( Pk
≤ i=1 (−1)i+1 Si for k odd,
P (E1 ∪ E2 ∪ . . . ∪ En ) Pk (3.6)
≥ i=1 (−1)i+1 Si for k even.
P (E1 ∪ E2 ∪ . . . ∪ En ) ≤ S1 .
MP : Ω → R≥0
given by
MP (ω) = P ({ω}). (3.8)
In other words, MP assigns a probability to each element ω ∈ Ω. The function
MP is often called a probability mass function. It can be verified that if P
satisfies the Kolmogorov’s axioms (Section (3.2.1.3)) then MP must satisfy
the following (called the probability mass function conditions or probability
axioms):
1. 0 ≤ MP (ω) ≤ 1, for each ω ∈ Ω, and,
P
2. ω∈Ω MP (ω) = 1.
F = 2Ω .
Thus for a discrete setting the probability space is specified by the triplet:
(Ω, 2Ω , MP ) (3.9)
2. The professor asserts that each type in her pond is equally likely to be
caught in a scanner (we call this the uniform model).
N
X
pr(xi ) = 1
i=1
Next, there is a need to show that the probability function P (as derived from
MP ) satisfies the probability measure conditions. This is left as an exercise
for the reader (Exercise 15).
Back to the query. Let E denote the event that the outcome of the 2-
scanner is homogenous. We claim the following:
E = {- -, A-, -A, -C, C-, -G, G-, -T, T-, AA, CC, GG, TT}
Treating E as the union of singleton sets and since all singleton intersections
are empty, we get
X
P (E) = P ({uv}) (probability measure cond (2))
{uv}∈E
X
= MP (uv) (probability mass function defn (3.8))
{uv}∈E
X
= pr(u)pr(v) (using Eqn (3.12))
{uv}∈E
Let Ē denote the event that the scan is not homogenous. Then P (Ē) is
given by:
P (Ē) = P (Ω \ E) (since E ∪ Ē = Ω)
= P (Ω) − P (E) (probability measure cond (2), and,
since E ∩ Ē = ∅)
= 1 − P (E) (since P (Ω) = 1)
= 48/81
X ∼ P.
X : Ω → R,
i.e., X maps Ω to real numbers. See Section (3.2.1.2) for the notion of mea-
surability of functions. Often
P ({ω ∈ Ω | X(ω) ≤ x0 })
is abbreviated as
P (X ≤ x0 ).
Further, since random variables are simply functions, they can be manipu-
lated as easily as functions. Thus, we have the following.
3 Recall that in the Kolmogorov’s axiomatic approach the event and its probability is the
primary concept.
4 Usually P is specified by the distribution parameters mean µ and variance σ 2 and written
as P (µ, σ).
58 Pattern Discovery in Bioinformatics: Theory & Algorithms
X1 + X2 .
X1 X2 .
3.2.6 Expectations
The mathematical expectation (or simply expectation) of X, denoted by
E[X], is defined as follows:
Z
E[X] = XdP (3.14)
Ω
(Ω, 2Ω , MP )
and X
E[X] = X(ω)MP (ω). (3.15)
ω∈Ω
The expected values of the powers of X are called the moments of X. The
l-th moment of X is defined by
Z
E[X l ] = X l dP. (3.16)
Ω
5 In this chapter we have been denoting an event with E. Expectation is also denoted with
1. Inequality property:
3. Nonmultiplicative: In general,
These results are straightforward to prove and follow from the properties of
integration (or summations in the discrete scenario).
XC , XA : Ω → R
XZ (ω) = l,
E[XA ] = 18/81.
The variance of XC , V [XC ], is the second moment about the mean, and is
given by
Thus
P (x ≤ X ≤ x + dx) = f (x)dx.
F (x) = P (X ≤ x).
Basic Statistics 61
(a) Binomial
n
pk (1 − p)n−k
k = 0, 1, . . . , n k np np(1 − p)
(b) Poisson
e−λ λk
k∈N k! λ λ
Probability
density function f
(continuous)
(c) Normal
(x−µ)2
x∈R √1 exp − 2 µ σ2
σ 2π 2σ
Note that the density function f and the cumulative distribution function F
are related by
Z x
F (x) = f (t)dt.
−∞
The binomial and Poisson distribution are among the most well-known
discrete probability distributions. The normal or the Gaussian is the most
commonly used continuous distribution.
Binomial distribution is the discrete probability distribution that expresses
the probability of the number of 1’s in a sequence of n independent 0/1 ex-
periments, each of which yields 1 with probability p. Thus the mass function
n k
prbinomial (k; n, p) = p (1 − p)n−k
k
gives the probability of seeing exactly k number of 1’s for a fixed n and p. It
can be verified that
n
X
prbinomial (k; n, p) = 1.
k=0
62 Pattern Discovery in Bioinformatics: Theory & Algorithms
LEMMA 3.1
Let X1 and X2 be two independent random variables.
Basic Statistics 63
X = X1 + X2 ∼ Binomial(n1 +n2 , p)
X = X1 + X2 ∼ P oisson(λ1 +λ2 )
pr(X = k) = pr ((X1 + X2 ) = k | X1 = i)
k
X
= f1 (i)f2 (k − i).
i=0
fi is the probability mass function for Xi , i = 1, 2. We now deal with the two
distributions separately.
Binomial distribution:
k
X n1 i n1 −i n2
pr(k) = p (1 − p) pk−i (1 − p)n2 −k+i
i=0
i k − i
k
X n1 n2 n1 + n2
= ,
i=0
i k−i k
Thus,
n1 + n2 k
pr(k) = p (1 − p)n1 +n2 −k .
k
Hence
X ∼ Binomial(n1 + n2 , p).
64 Pattern Discovery in Bioinformatics: Theory & Algorithms
Poisson distribution:
k −λ1 i
!
X e λ 1 e−λ2 λ2k−i
pr(k) =
i=0
i! (k − i)!
k
X λi1 λ2k−i
= e−(λ1 +λ2 )
i=0
i!(k − i)!
Thus
e−(λ1 +λ2 ) (λ1 + λ2 )k
pr(k) =
k!
Hence
X ∼ P oisson(λ1 + λ2 ).
LEMMA 3.2
If X1 ∼ N ormal(µ1 , σ12 ) and X2 ∼ N ormal(µ2 , σ22 ), are two independent
random variables, then
(x + y)n .
The function values are indeed the coefficients of this polynomial. Putting
x=p
and
y =1−p
in fact shows the values add up to 1.
66 Pattern Discovery in Bioinformatics: Theory & Algorithms
P oisson(λ).
X − np
p
np(1 − p)
N ormal(0, 1).
The professor asserts that the pond is large enough for these huge scanners.
Now, we must work closely with the in-house scientists of the manufacturer
to model this problem appropriately. After a series of carefully controlled
experiments, they make the following observations for the ∞-scanners.
λC = λG = .054,
λT = λA = .018,
respectively.
3. For a fixed i, the chance of having a nonzero outcome in the ith of the
∞-scanner in multiple scans (trials) is zero.
Basic Statistics 67
(Ω, 2Ω , MP )
as follows:
Ω = {(iA , iC , iG , iT ) | 0 ≤ iA , iC , iG , iT }.
The occurrence of each type is independent of the other (see observation 3),
and we model each as a Poisson distribution. Thus:
Y
MP (iA , iC , iG , iT ) = P oisson(iz ; λz ). (3.17)
z={A, C, G, T}
and
λC = λG = .054,
λT = λA = .018.
(iAT , iCG ).
(Ω, 2Ω , MP )
as follows:
Ω = {(iAT , iCG ) | 0 ≤ iAT , iCG }.
68 Pattern Discovery in Bioinformatics: Theory & Algorithms
Event E ⊆ Ω
AT AT
0,0 CG 0,0 CG
(a) Discrete (b) Continuous
Also we use the fact the sum of two Poisson random variables (parame-
ters λ1 , λ2 ) is another Poisson random variable (parameter λ1 + λ2 ), by
Lemma (3.1). Thus
MP : Ω → R
is defined as follows:
iCG ≥ 3iAT
as shown. Then
∞
X 3i
X CG
sample and passing them through an electric and/or magnetic field, thus
separating the ions of differing masses. The relative abundance is deduced by
measuring intensities of the ion flux.
The scanner provides the relative masses of the four types in a catch. The
average mass of a type scanned has been provided by the manufacturer as
µA = µT = 15
and
µC = µG = 36
and a standard deviation of √
σ=5 2
for each type.
Going back to our running example and using the ‘condensed’ model the
fraction of the number of A+T’s is denoted by the random variable X which
follows a normal distribution, using Lemma (3.2),
X ∼ N ormal(x : 2µC , 2σ 2 ).
Let E denote the event that there are thrice as many as C+G’s than A+T’s
in a scan. Our interest is in the shaded portion of the Ω space shown in
Figure 3.1(b). The shaded region corresponds to
iCG ≥ 3iAT
as shown. Then
Z ∞ Z 3y
P (E) = XY dY dX
0 0
= 0.7154.
THEOREM 3.3
(Markov’s inequality) If X is a random variable and a > 0, then
E[|X|]
P (X ≥ a) ≤ .
a
X:Ω→R
Thus, Z
E[|X|] ≥ |X(ω)|dP,
Ω2
since Z
|X(ω)|dP ≥ 0.
Ω1
But, Z Z
|X(ω)|dP ≥ a dP,
Ω2 Ω2
since |X(ω)| ≥ a for all ω ∈ Ω2 and
Z
dP = P (|X| ≥ a) .
Ω2
Hence,
E[|X|] ≥ aP (|X| ≥ a) .
THEOREM 3.4
(Chebyshev’s inequality theorem) If X is a random variable with mean
µ and (finite) variance σ 2 , then for any real k > 0,
1
P (|X − µ| ≥ kσ) ≤ .
k2
The mode of S is the element xi that occurs the most number of times in the
collection. Sorting the elements of S as
xi 1 ≤ xi 2 ≤ . . . ≤ xi n ,
xmin = min xi ,
i
xmax = max xi .
i
µn−sample .
σn−sample
µpopulation
Thus if we need to halve the standard error, the sample size must be
quadrupled.
2. The sampling distribution of the sample mean is in fact a normal dis-
tribution
N ormal(µn−sample , σn−sample )
(assuming the original population is ‘reasonably behaved’ and the sam-
ple size n is sufficiently large). A weaker version of this is stated in the
following theorem.
THEOREM 3.5
(Standardization theorem) If
Xi ∼ N ormal(µ, σ 2 ),
74 Pattern Discovery in Bioinformatics: Theory & Algorithms
1 ≤ i ≤ n, and
n
1X
X̄ = Xi ,
n i=1
then
X̄ − µ
Z= √ ∼ N ormal(0, 1).
σ/ n
THEOREM 3.6
(Sample mean & variance theorem) If
X1 , X2 , X3 , . . . ,
Next,
" P 2 #
Xi
E[X̄n2 ] =E i
n
1 X 2 X
= 2E Xi + 2Xi Xj
n i i6=j
1 X X
= 2 E[Xi2 ] + 2 E[Xi Xj ] .
n i i6=j
The last step above is due to linearity of expectations, The random variables
are independent thus for each i and j,
E[Xi Xj ] = E[Xi ]E[Xj ].
Further, for each i,
E[Xi2 ] = σ 2 + µ2 .
Hence we have,
1 X X
E[X̄n2 ] = 2 (σ 2 + µ2 ) + 2 µ2
n i i6=j
2
1 2 2 n −n 2
= 2 n(σ + µ ) + 2 µ .
n 2
Recall that
V [X̄n ] = E[X̄n2 ] − E[X̄n ]2 .
Thus,
σ2
V [X̄n ] = + µ2 − µ2
n
σ2
= .
n
In the last theorem, we saw that the sample mean is µ. Law of Large Num-
bers gives a stronger result regarding the distribution of the random variable
X̄n . It says that as n increases the distribution concentrates around µ. The
formal statement and proof is given below.
THEOREM 3.7
(Law of large numbers) If
X1 , X2 , X3 , . . . ,
76 Pattern Discovery in Bioinformatics: Theory & Algorithms
where
X1 + X2 + . . . + Xn
X̄n =
n
is the sample average.
E[X̄n ] = µ
and
σ
V [X̄n ] = √ .
n
For any k > 0, Chebyshev’s inequality on the random variable X̄n gives
σ 1
P |X̄n − µ| ≥ k √ ≤ 2.
n k
Thus letting
σ
ǫ=k ,
n
we get
σ2
P |X̄n − µ| ≥ ǫ ≤ 2 .
ǫ n
In other words, since P is a probability distribution,
σ2
P |X̄n − µ| < ǫ ≤ 1 − 2 .
ǫ n
σ2
lim = 0.
n→∞ ǫ2 n
Thus
lim P (|X̄n − µ| < ǫ) = 1.
n→∞
Basic Statistics 77
Using the above and Taylor’s formula, we have that for all small enough
t > 0,
Φ′X (0) Φ′′ (0) 2
ΦX (t) = ΦX (0) + t+ t + ...
1! 2!
E[X] E[X 2 ] 2
= 1+i t + i2 t + o(t2 ).
1 2!
4. If
X ∼ N ormal(0, 1)
implying that
1 2
FX (x) = √ e−x /2 ,
2π
then one can show using complex integration that
2
ΦX (t) = e−t /2
.
THEOREM 3.8
(Central limit theorem) If
X1 , X2 , X3 , . . . ,
Sn = X 1 + . . . + X n .
PROOF Let
Xi − µ
Yi = .
σ
Then Yi is a random variable with mean 0 and standard deviation 1 and
n
1 X
Zn = √ Yi .
n i=1
Basic Statistics 79
t2
ΦYi (t) = 1 − + o(t2 ). (3.20)
2
It follows from the Equations (3.19) and (3.20),
n
t2
γ(t, n)
ΦZn (t) = 1− + ,
2n n
where
lim γ(t, n) = 0.
n→∞
Hence
n
t2
lim ΦZn (t) = lim 1 − + γ(t, n)
n→∞ n→∞ 2n
2
= e−t /2
.
2
Thus the characteristic function of Zn approaches e−t /2
as n approaches ∞.
By Property 4, we know that
2
e−t /2
is the characteristic function of a normally distributed random variable with
mean 0 and standard deviation 1. This implies (with some more work, which
we omit) that the probability distribution of Zn converges to
N ormal(0, 1)
as n approaches ∞.
(Ω, 2Ω , MP )
is defined as follows:
Ω = {0, 1, 2, . . . , 20}.
By our null hypothesis, MP is defined by the distribution
Binomial(20, 1/2).
Thus,
20 ω
MP (ω ∈ Ω) = p (1 − p)20−ω .
ω
3. Outcome of the experiment: Let X denote the number of C+G types
in a scan. Suppose in our experiment of 20 scans we see 16 C+G types.
Then
20
X
P (X ≥ 16) = MP (k)
k=16
20 20
X 20 1
=
k 2
k=16
= 0.0573
In other words,
p-value = 0.0573
3.4 Summary
The chapter was a whirlwind tour of basic statistics and probability. It is
worth realizing that this field has taken at least a century to mature, and what
it has to offer is undoubtedly useful as well as beautiful. A correct modeling
of real biological data will require understanding not only of the biology but
also of the probabilistic and statistical methods.
The chapter has been quite substantial in keeping with the amount of work
in bioinformatics literature that use statistical and probabilistic ideas. In-
formation theory is an equally interesting field that deserves some discussion
here. Since we have already introduced the reader to random variables and
expectations, a few basic ideas in information theory are explored through
Exercise 21.
3.5 Exercises
Exercise 12 Consider the Kolmogorov’s axioms of Section (3.2.1.3). Show
that under these conditions, for each E ∈ F,
0 ≤ P (E) ≤ 1.
MP (uv) = pr(u)pr(v).
MP (uv) = pr(u)pr(v).
Hint:
1. Show that
∞
X
f (k; λ) = 1.
k=−∞
We assume that the functions f1 and f2 are defined on the whole real
line, extending them by 0 outside their domains of definition otherwise.
For the intrepid reader: can you prove the following?
(a)
f1 ∗ f2 = f2 ∗ f1 .
pr(X = k).
(b) Show that the entropy H(X) is maximized under the uniform dis-
tribution i.e, each outcome is equally likely.
2. Let X1 and X2 be two discrete random variables and let
pr(ki )
pr(k1 , k2 )
Show that
H(X|Y ) = H(X, Y ) − H(Y ).
Show that
(1) I(X1 ; X2 ) = H(X1 ) − H(X1 |X2 )
(2) I(X1 ; X2 ) = I(X2 ; X1 ).
Hint: 1. & 2. Let n = 2. The value of the min and max functions respectively,
is constant along the dashed lines shown below. We need to integrate these
two functions over the unit area. Note that for the min function, in the lower
triangle min(x, y) = y and in the upper triangle min(x, y) = x. The reverse
is true for the max function.
min max
3. Use 1 and 2. and the expected path depth (see Exercise 11).
Comments
My experience with students (and even some practicing scientists) has been
that there is a general haziness when it comes to interpreting or understand-
ing statistical computations. I prefer Kolmogorov’s axiomatic approach to
probability since it is very clean and elegant. In this framework, it is easy
to formulate the questions, identify the assumptions and most importantly
convince yourself and others about the import and the correctness of your
models.
A small section of this chapter belabors on the proofs of some fundamental
theorems in probability and statistics. This is just to show that most of the
results we need to use in bioinformatics are fairly simple and the basis of
these foundational results can be understood by most who have a nodding
acquaintance with elementary calculus.
In my mind the most intriguing theorem in this area is the Central Limit
Theorem (and also the Large Number Theorem). While the latter, quite sur-
prisingly, has an utterly simple proof, the former requires some familiarity
with function convergence to appreciate the underlying ideas. In a sense it is
a relief to realize that the proof of the most used theorem in bioinformatics is
not too difficult.
Chapter 4
What Are Patterns?
4.1 Introduction
Patterns haunt the mind of a seeker. And, when the seeker is equipped
with a computer with boundless capabilities, limitless memory, and easily
accessible data repositories to dig through, there is no underestimating the
care that should be undertaken in defining such a task and interpreting the
results. For instance, the human genome alone is 3,000,000,000 nucleotides
long! A sequence of 3 billion characters defined on an alphabet size of four
(A, C, G, T) offers endless possibilities for useful as well as possibly useless
nuggets of information.1
The genomes are so large in terms of nucleotide lengths that a database 2
reports the sizes of the genomes by its weight in picograms (1 picogram = one
trillionth of a gram). It is also important to note that genome size does not
correlate with organismal complexity. For instance, many plant species and
some single-celled protists, have genomes much larger than that of humans.
Also, genome sizes (called C-values ) vary enormously among species and that
relationship of this variation to the number of genes in the species is unclear.
The presence of noncoding regions settles the question to a certain extent
in eukaryotes. The human genome, for example, comprises only about 1.5%
protein-coding genes, with the remaining being various types of noncoding
DNA, especially transposable elements.3
Hence our tools of analysis or discovery must be very carefully designed.
Here we discuss models that are mathematically and statistically ‘sensible’
and hopefully, this can be transposed to biologically ‘meaningful’.
1 Also,Ramsey theory states that even random sequences when sufficiently large will contain
some sort of pattern (or order).
2 Gregory, T.R. (2007). Animal Genome Size Database. http://www.genomesize.com
3 As reported by the International Human Genome Sequencing Consortium 2001.
89
90 Pattern Discovery in Bioinformatics: Theory & Algorithms
2. its occurrence (i.e., what condition must be satisfied to say that p occurs
at position i in the input).
Thus a pattern can be specified in many and varied ways, but its location
list, mercifully, is always a set. This gives a handle to extract the common
properties of patterns, independent of its specific form.
4.3.1 Operators on p
What operators should (or can) be defined on patterns? Can we identify
them even before assigning a specific definition to a pattern?
Here we resort to the duality of the patterns. So, what are the operations
defined on location lists? Since sets have union and intersection operations
defined on them, then it is meaningful to define two operators as follows.
The plus operator (⊕):
p = p1 ⊕ p2 ⇔ Lp = Lp1 ∩ Lp2 .
p = p1 ⊗ p2 ⇔ Lp = Lp1 ∪ Lp2 .
Further, the following properties hold which again are a direct consequence
of the corresponding set operation.
p1 ⊕ p2 = p2 ⊕ p1
p1 ⊗ p2 = p2 ⊗ p1
(p1 ⊕ p2 ) ⊕ p3 = p1 ⊕ (p2 ⊕ p3 )
(p1 ⊗ p2 ) ⊗ p3 = p1 ⊗ (p2 ⊗ p3 )
(p1 ⊗ p2 ) ⊕ p3 = (p1 ⊕ p3 ) ⊗ (p2 ⊕ p3 )
(p1 ⊕ p2 ) ⊗ p3 = (p1 ⊗ p3 ) ⊕ (p2 ⊗ p3 )
p1 ⊗ p2 ⊗ . . . ⊗ pl = p.
{p1 , p2 , . . . , pl } ֒→ p,
p = A T T G C,
on an input string I:
5 The reader may be familiar with the the use of maximality which is widely used to remove
redundancies.
94 Pattern Discovery in Bioinformatics: Theory & Algorithms
and
Lp = {3, 10, 18}.
Intuitively, what are the nonmaximal patterns? Assume, for convenience, that
a string pattern must have at least two elements. Then all the nonmaximal
patterns p1 , p2 , . . . , p9 are shown below:
p=ATTGC p1 = A T T G p2 = A T T p3 =AT
p4 = T T G C p5 = T T G p6 =TT
p7 = T G C p8 =TG
p9 =GC
Note that if the input were I ′ , then the occurrences of p8 is shown below:
I′ = G A A T T G C G G A T T G C C A C A T T G C C T G
p1 ֒→ p,
Lp1 = Lp2 ,
then one of them must be nonmaximal. This agrees with the intuitive notion
that if a multiple motifs occur in the same positions, then the largest (or most
informative) is the maximal motif.
Note that maximality is always in terms of an input I. There is no notion
of maximality on an isolated string p.
{p1 , p2 , . . . , pl } ֒→ p.
with
p11 , p12 , . . . , p1l1 ,
p21 , p22 , . . . , p2l2 ,
.. .. .. ∈ P (I).
. . .
pl1 , pl2 , . . . , plll .
p 1 , p 2 , . . . , pl ,
and each pi in turn is redundant w.r.t. set Pi , then p is redundant with respect
to the union of sets, i.e., [
Pi ֒→ p.
i
THEOREM 4.1
Let P (I) be the set of all patterns that occur in a given input I. Then the
collection of irredundant patterns,
is unique.
COROLLARY 4.1
Let P (I) be the set of all patterns that occur in a given input I. Then the
collection of maximal patterns,
is unique.
p 6∈ P (I),
can be written as
p1 ⊗ p2 ⊗ . . . ⊗ pl = p,
for some
p1 , p2 , . . . , pl ∈ P (I).
Thus the basis is defined as follows:
Note that
Pbasis (I) ⊆ Pmaximal (I) ⊆ P (I).
p ֒→ pi ,
then we write
p ֒→ p1 , p2 , . . . , pl
What Are Patterns? 97
But each location in LBC.E is one position away from a position in LABC.E .
LBC.E = (i, j + δ) | (i, j) ∈ LABC.E ,
where δ = 1. The same holds for LBC . Intuitively, the lists should be the
same since they are capturing the same common segments in the input with
a phase shift. Thus, the location lists are ‘fuzzily’ equal, i.e.,
Also, [
LA..DE = LAB.DE LA.CDE ,
[
LA.C.E = LA.CDE LABC.E ,
98 Pattern Discovery in Bioinformatics: Theory & Algorithms
[
LAB..E = LAB.DE LABC.E ,
[ [
LA...E = LA.CDE LAB.DE LABC.E .
What are the redundancies ? The redundancies are as shown below (redun-
dancy with restriction l = 1 gives the maximal elements):
AB..E ֒→ AB, B..E
A..DE ֒→ DE, A..D {AB.DE, A.CDE} ֒→ A..DE
A.C.E ֒→ A.C, C.E {A.CDE, ABC.E} ֒→ A.C.E
maximal
ABC.E ֒→ BC, ABC, BC.E {AB.DE, ABC.E} ֒→ AB..E
A.CDE ֒→ CD, CDE, A.CD {A.CDE, AB.DE, ABC.E} ֒→ A...E
AB.DE ֒→ B.D, AB.D, B.DE
Thus
s1 = d e a b c,
P (I) = {{a, b}, {a, b, c}, {e, d}, {a, b, c, d, e}} .
s2 = c a b e d.
Each pattern is in the same two collection of segments in the input. Thus by
using the sizes of the patterns, i.e.,
it is possible to guess if one pattern is contained in the other (in this example
the occurrence is without gaps). Hence
L{a,b} = (i, j + δ(i, j)) | (i, j) ∈ L{a,b,c,d,e} ,
What Are Patterns? 99
where
δ(1, 1) = 2 and δ(2, 1) = 1.
Thus the value of the δ’s is clear from the context. Using similar arguments,
we get the following:
Thus
Pmaximal (I) = Pbasis (I) = {{a, b, c, d, e}}.
|Lp | ≥ k.
Thus the pattern must occur at least k times in the input to meet this quorum
constraint.
L = {L | L is a collection of positions i of I} .
|L′ |
< ǫ.
|L|
The specification of a pattern should not be so lax that any arbitrary collec-
tion of segments in the input qualify to be a pattern. In other words, the
probability of a list of positions to correspond to a pattern p should be fairly
low i.e.,
pr(L ∈ L′ ) < ǫ.
s1 = A C T T C G
appears with
(1b) C.TC s2 = C C G T C wild card
s3 = C T T C C G
s1 = A C T T C G
appears with gap
(1c) C T T-2,3-G s2 = C C G T C
of 2 or 3 wild cards
s3 = C T T C C G
permutation
pattern
s1 = g2 g4 g1 g3 g6
s2 = g7 g9 g1 g2 g3 g4 appears together
(2a) {g1 , g2 , g3 , g4 }
in any order
s3 = g8 g3 g1 g4 g2
s1 = g2 g5 g1 g6 g3 appears together
(2b) {g1 , g2 , g3 } s2 = g4 g9 g1 g2 g3 g7 in any order,
s3 = g6 g3 g1 g8 g2 with at most 1 gap
s1 = g2 g5 g1 g4 g6 appears together
(2c) {g1 , g2 , g4 } s2 = g4 g9 g1 g2 g3 g4 in any order,
s3 = g6 g3 g1 g4 g2 in fixed window size 4
s1 = m 2 m 1 m 2 m 5 m 6 appears together
(2d) {m1 (2), m2 } in any order, but
s2 = m 4 m 9 m 1 m 2 m 2 m 4
m2 appears 2 times
102 Pattern Discovery in Bioinformatics: Theory & Algorithms
3 q1 = 7 1 2 6 3 4 in cluster,
S 4 E q2 = 3 7 1 4 2 8 1 precedes 2
(3b) 1 2 q3 = 9 4 1 3 8 2 (with gap)
q1 = 1 7 2 6 3 4 2 precedes 3
sequence pattern q2 = 2 3 1 4 2 7 at each
(3c) 2→3 q3 = 9 4 1 2 8 3 occurrence
isomorphic
to a subgraph
4.8 Exercises
Exercise 23 (Nonuniqueness) The chapter discussed a scheme, using re-
dundancy, to trim P , the set of all patterns in the data to produce a unique
set, say P ′ .
Construct a specification of a pattern and an elimination scheme where the
resulting P ′ is not unique.
Hint: Let the pattern p be an interval (l, u) on reals. If two intervals overlap
and at least half of one interval is in the overlap region, then the smaller
interval is eliminated from the set of intervals P . Concrete example:
Σ \ Π(m).
σ5 σ4 σ1 σ6 σ2 σ4 σ5 σ3 σ2 .
m1 , m2 , . . . , mr ,
Π(mi ) ∩ Π(mj ) = ∅.
p = p1 ⊕ p2 ⇔ Lp = Lp1 ∩ Lp2 ,
p = p1 ⊗ p2 ⇔ Lp = Lp1 ∪ Lp2 ,
What Are Patterns? 105
1. p1 ⊕ p2 = p2 ⊕ p1 .
2. p1 ⊗ p2 = p2 ⊗ p1 .
3. (p1 ⊕ p2 ) ⊕ p3 = p1 ⊕ (p2 ⊕ p3 ).
4. (p1 ⊗ p2 ) ⊗ p3 = p1 ⊗ (p2 ⊗ p3 ).
5. (p1 ⊗ p2 ) ⊕ p3 = (p1 ⊕ p3 ) ⊗ (p2 ⊕ p3 ).
6. (p1 ⊕ p2 ) ⊗ p3 = (p1 ⊗ p3 ) ⊕ (p2 ⊗ p3 ).
2. Can the pattern also be defined on sets as characters? What should the
constraints be?
How many such multiple sequences result in general? Is there any escape from
this explosion?
2. It is better to allow for multiple sets in patterns, as long as each multi-set
S in the pattern satisfies
S ⊆ Si′ ,
where Si′ is the multi-set in the corresponding position of its ith occurrence
in the input.
on the same (or similar) pattern forms give rise to different pattern specifi-
cations. Usually the more flexible a pattern, the more likely it is to pick up
false-positives. However, biological reality may demand such a flexibility.
Discuss which of the patterns are more nontrivial than the others in each
of the groups below. Why?
l = j2 − j1 + 1.
0 < α ≤ 1,
α (j2 − j1 + 1).
Given I and some α, the task is to find a minimum number of block patterns
that partition I into blocks. In other words, every column, j, must belong to
exactly one block
What Are Patterns? 107
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
1 0 1 0 1 1 0 0 1 1 0 1 0 1 1 0 0 1
1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0
1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 1 0
1 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 1
1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1 1 0
b1 b2 b3 b1 b2
1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0
pb1 pb2 pb3 pb1 pb2
(1) Block Partition 1. (2) Block Partition 2.
1 2 3 4 5 6 7 8 9 1 2 3 4
5 6 7 8 9
1 0 1 0 1 1 0 0 1 1 0 1 0
1 1 0 0 1
1 0 0 0 1 0 0 0 0 1 0 0 0
1 0 0 0 0
1 0 1 0 1 0 1 1 0 1 0 1 0
1 0 1 1 0
1 1 0 1 1 0 0 0 1 1 1 0 1
1 0 0 0 1
1 0 0 0 1 0 1 1 0 1 0 0 0
1 0 1 1 0
b1 b2 b1
1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0
pb1 pb2 pb1
(3) Block Partition 3. (4) Block Partition 4.
# of occurrences,
p length property
k
pattern
If p′ is a substring p,
pattern ≥2 maximal then p′ is a also a pattern
If p is a substring p′ ,
unique =1 minimal and p′ occurs in the input,
then p′ is also unique
anti-pattern
If p is a substring p′ ,
forbidden =0 minimal then p′ is also forbidden
Hint: What and why is the dramatic shift in paradigm as k changes from 0
to 1 to ≥ 2.
What Are Patterns? 109
Comments
While we get get down and dirty in the rest of the chapters, where the focus
is on a specific class of patterns, in this chapter we indulge in the poetry of
abstraction. One of the nontrivial ideas in the chapter, is that of redundancy
of patterns, which I introduced in [PRF+ 00], by taking a frequentist view
of patterns in strings, rather than the usual combinatorial view. Also, a
nonobvious consequence of this abstraction is the identification of maximality
in permutations as recursive substructures (PQ trees; see Chapter 10).
As an aside, out of curiosity, we take a step back and pick a few definitions of
the word ‘pattern’ from the dictionary:
1. natural or chance marking, configuration, or design: patterns of frost
markings.
2. a combination of qualities, acts, tendencies, etc., forming a consistent
or characteristic arrangement: the behavior patterns of ambitious scien-
tists.
3. anything fashioned or designed to serve as a model or guide for some-
thing to be made: a paper pattern for a dress.
In fact none of these, that I picked at random, actually suggest repetitiveness.
The etymology of the word is equally interesting, confirming a Latin origin,
and I quote from Douglas Harper’s online etymology dictionary:
1324, ‘the original proposed to imitation; the archetype; that which is
to be copied; an exemplar’ [Johnson], from O.Fr. patron, from M.L.
patronus (see patron). Extended sense of ‘decorative design’ first recorded
1582, from earlier sense of a ‘patron’ as a model to be imitated. The
difference in form and sense between patron and pattern wasn’t firm
till 1700s. Meaning ‘model or design in dressmaking’ (especially one of
paper) is first recorded 1792, in Jane Austen. Verb phrase pattern after
‘take as a model’ is from 1878.
Part II
5.1 Introduction
In 1950, when it appeared that the capabilities of computing machines were
boundless, Alan Turing proposed a litmus test of sorts to evaluate a machine’s
intelligence by measuring its capability to perform human-like conversation.
The test with a binary PASS/FAIL result, called the Turing Test, is described
as follows: A human judge engages in a conversation via a text-only channel,
with two parties one of which is a machine and the other a human. If the
judge is unable to distinguish the machine from the human, then the machine
passes the Turing Test.
In the same spirit, can we produce a stream or string of nucleotides ‘in-
distinguishable’ from a human DNA fragment (with the judge of this test
possibly being a bacteria or a virus)?
In this chapter we discuss the problem of modeling strings: DNA, RNA
or protein sequences. We use basic statistical ideas without worrying about
structures or functions that these biological sequences may imply. Such mod-
eling is not simply for amusing a bacteria or a virus but for producing in-silico
sample fragments for various studies.1
1 A quick update on the capabilities of computing machines: More than half a century later,
113
114 Pattern Discovery in Bioinformatics: Theory & Algorithms
TTACGTTACGTTACGTTACG
repeat, where n varies across samples providing different alleles. This repeat
in fact is very frequent in human genome.
A short tandem repeat (STR) in DNA is yet another class of polymorphisms
repeating units of 2-10 base pairs in length and the repeated sequences are
directly adjacent to each other. For example
(C A T G) n
is an STR. These are usually seen in the noncoding intron region, hence ‘junk’
DNA. The A C repeat is seen on the Y chromosome.
Short interspersed nuclear elements (SINEs) and Long Interspersed Ele-
ments (LINEs) are present in great numbers in many eukaryote genomes.
They are repeated and dispersed throughout the genome sequences. They
constitute more than 20% of the genome of humans and other mammals. The
most famous SINE family are the Alu repeats typical of the human genome.
SINEs are very useful as markers for phylogenetic analysis whereas STRs and
VNTRs are useful in forensic analysis.
Moral of the story: Although this section discussed the human DNA, re-
peats are known to exist in nonhuman DNA including that of bacteria and
archaea. Thus DNA in a sense is highly nonuniform and it is perhaps best to
model segments of the DNA separately, than a ‘single-size-fits-all’ model.
Modeling the Stream of Life 115
2 Optical mapping is a single molecule method and the reader is directed to [SCH+ 97, Par98,
CJI+ 98, Par99, PG99] for an introduction to the methods and the associated computational
problems.
3 A statistician’s love for alliteration is borne out by the use of terms such as iid and MCMC
N
X
praj = 1.
j=1
let
r0 = 0,
rj = rj−1 + 1000 praj , for each 0 < j ≤ N .
rj−1 ≤ r < rj .
We leave the proof of this statement as an exercise for the reader (Exercise 35).
Having set the stage, a random permutation is the outcome of a random trial
of the probability space defined above.
r0 = 0,
1
pr(σj ∈ Σk ) = ,
|Σk |
rj = rj−1 + 1000 prσ .
Wrapping up. The second model provides a way of constructing (or sim-
ulating) a random permutation, which has the precise property of the first
model.
Notice that X
MP (s) = 1.
s∈Ω
Having set the stage, a random string is the outcome of a random trial of the
probability space defined above.
and X
prai = 1.
i
is a random string. This construction has already been discussed in the earlier
part of this section.
Wrapping up. The second model provides a way of constructing (or simu-
lating) a random string, which has the precise property of the first model.
In a sense the random string is the simplest model and other more sophis-
ticated schemes, that are possibly better models of biological sequences, are
discussed in the following sections.
X1 , X2 , X3 , . . .
where each random variable is a generalized Bernoulli trial (see Section 3.2.4),
i.e., it takes one possible value from a finite set of states S, with outcome
Modeling the Stream of Life 121
X1 , X2 , X3 , . . .
The possible values of Xi form a countable set S is called the state space
of the chain. We will concern ourselves only with finite S (often written as
|S| < ∞). Also by Bayes’ rule (Equation (3.4)), we get:
0.8 0.2 P
0.2 0.8 0.2 C
C G 0.8 0.2 G
0.8 C G
(a) Finite state machine (b) The transition probability matrix
where
2. P is a stochastic matrix.
C G G G C C G G G C.
In fact the process can generate all possible strings of C and G. Then, how
is this different from another Markov process with the same state space S,
capable of generating all possible strings of C and G, but with a different
transition matrix P?
So a natural question arises from this:
S = {A,T,C,G},
and the following stochastic matrix, which is the transition probability matrix
of a Markov process with
0.3 0.3 0.4 0 A
0 0.4 0 0.6 T
P=
0.25 0.25 0.25 0.25 C (5.5)
0 0.5 0 0.5 G
A T C G
It turns out that there is no real π (i.e., all entries of the vector are real)
satisfying
πP = π
for this P. In matrix theory, we say that P is reducible if by a permutation of
rows (and corresponding columns) it can be reduced to the form
′ P11 P12
P = ,
0 P22
where P11 and P22 are square matrices. If this reduction is not possible, then
P is called irreducible. The following theorem gives the condition(s) under
which a matrix has a real solution.
THEOREM 5.1
(Generalized Perron-Frobenius theorem) A stochastic irreducible matrix
P has a real solution (i.e., values of π are real) to Equation (5.4).
The proof of this theorem is outside the scope of this book and we omit it.
It turns out that P of Equation (5.5) is not irreducible, since the following
holds:
0.3 0.4 0.3 0 A
′ P11 P12 0.25 0.25 0.25 0.25 C
P = =
0 P22 0 0 0.5 0.5 G
0 0 0.4 0.6 T
A C G T
Given a matrix, is there a simple algorithm to check if a matrix is irreducible?
We resort to algebraic graph theory for a simple algorithm to this problem.
We have already seen that P is the edge weight matrix of a directed graph
with S as the vertex set (also called the FSM earlier in the discussion).
A directed graph is strongly connected if there a directed path from xi to
xj for each xi and xj in S. See Figure 5.2 for an example. The following
theorem gives the condition under which a matrix is irreducible.
THEOREM 5.2
(Irreducible matrix theorem) The adjacency matrix of a strongly con-
nected graph is irreducible.
Modeling the Stream of Life 125
O(|V | + |E|),
recursive algorithm to traverse a graph. It turns out that we can check if the
directed graph is strongly connected by doing the following.
1. Fix a vertex v ′ ∈ V .
3. Transpose the graph, i.e., reverse the direction of every edge in the graph
to obtain GT .
Thus we can check for strongly connectedness of the graph in just two DFS
traversals i.e., in
O(|V | + |E|)
time. Why is the algorithm correct?
• The first DFS traversal on the graph G initiated at v ensures that every
other vertex is reachable from v.
C G C G
A U A U
T T
THEOREM 5.3
(Irreducible matrix theorem) A stochastic, irreducible P has a unique
solution to Equation (5.4).
The proof of this theorem is outside the scope of this book and we omit it.
Before concluding the discussion, we ask another question: Does it matter
what state we start the Markov process with? Consider the following transition
matrix:
01
P=
10
Note that P is stochastic and irreducible, hence the solution π is unique and
is given below:
01
[1/2 1/2] = [1/2 1/2]
10
However,
2
01 10
P2 = = = Id
10 01
Modeling the Stream of Life 127
Thus
Id for k even,
Pk =
P for k odd.
Since
[1/2 1/2]
is the unique nonnegative left eignenvector, it is clear that
lim v · Pk
k→∞
lim v · Pk = π
k→∞
lim Pk = 1π
k→∞
where 1 is the column vector with all entries equal to 1. In other words,
With large t, the Markov chain forgets its initial distribution and con-
verges to its stationary distribution.
We can show that the FSM graph is strongly connected, hence the matrix is
irreducible.
Another easy check is that since the matrix has no zero entry, the Markov
process is trivially irreducible and has a unique solution to Equation (5.4),
given as:
π = [5/24 5/24 7/24 7/24]
.
A C G T
128 Pattern Discovery in Bioinformatics: Theory & Algorithms
Now, we are ready to compute the probability, P (x), associated with string
x, where
x = x0 x1 x2 . . . xn ,
for a Markov process specified by P. This is given as follows:
n
Y
P (x) = π(x0 ) P (xi−1 |xi )
i=1
n
Y
= P[i − 1, i].
i=1
A 0.8 0.2 A P E
0.6 0.2 0.5 0.8 0.2 C 0.6 0.4 C
C G 0.8 0.2 G 0.5 0.5 G
T 0.4 0.8
0.5 T
C G A T
(a) (b) (c)
By our convention,
P (xi |zi ) = E[zi , zi ]
and
P (zi+1 |zi ) = P[zi , zi+1 ].
Then we get the following:
n
Y
P (x|z) = π(z0 ) E[z0 , x0 ] P[zi , zi−1 ]E[zi , xi ].
i=1
Why use HMMs? The primary reason for using a Hidden Markov Model
is to model an underlying ‘hidden’ basis for seeing a particular sequence of
observation x. This basis, under the HMM model, is the path z.
2. What is
P (x|z ∗ ),
the probability of observing x, given z ∗ ?
3. What is
P (x),
the (overall) probability of observing x?
The first and the second question are obviously related and are answered
simultaneously by using the Viterbi Algorithm. The third question is left as
an exercise for the reader (see Exercise 41).
We address the first question here. The algorithm for this problem was
originally proposed by Andrew Viterbi as an error-correction scheme for noisy
digital communication links [Vit67], hence it is also sometimes called a decoder.
LEMMA 5.1
(The decoder optimal subproblem lemma) Given an HMM,
(Σ, S, P, E),
and a string,
x = x1 x2 . . . xn ,
on Σ, consider
x1 x2 . . . xi ,
the i-length prefix of x. For a state
zj ∈ S
Modeling the Stream of Life 131
be optimal, i.e.,
zj1 zj2 . . . zji = arg max P (x1 x2 . . . xi |z1 z2 . . . zi )
z1 z2 ...zi
Let
fij
be the probability associated with string
x1 x2 . . . xi ,
Recall
zj = zji .
Then the following two statements hold:
fij = max f(i−1)k P (zj |zk ) P (xi |zj ) (5.6)
k
and
f1j = π(zj ) P (x1 |zj ). (5.7)
Also, let the optimal path ending at zj be written as
zj Path(zk′ ),
where
′
k = arg max f(i−1)k P (zj |zk ) P (xi |zj ) .
k
In spite of its intimidating looks, the lemma is very simple and we leave
the proof as an exercise for the reader (Exercise 43). In a nutshell, the lemma
states simply that the optimal solution,
x1 x2 . . . xn ,
132 Pattern Discovery in Bioinformatics: Theory & Algorithms
to the problem can be built from the optimal solutions to its subproblems
x1 x2 . . . xi .
We now describe the algorithm below based on this observation. Let x of
length n is the input string. Let Fi be a
1 × |S|
matrix where each
Fi [j]
stores fij of Lemma (5.1). Similarly,
P thi [j]
stores the corresponding path.
1. In the first phase, array F1 and P th1 are initialized.
2. In the second phase, where the algorithm loops, array Fi and P thi are
updated based on the observation in the lemma.
3. In the final phase, the algorithm goes over Fn [k] for each k and picks
the maximum value.
The algorithm takes
O(|S| |Σ|2 )
time. However, due to the underlying Markov process, Fi+1 depends only on
Fi and not on
Fi′ , i′ < i.
Thus only two arrays, say F0 and F1 , are adequate and the arrays can be
re-used in the algorithm. Thus the algorithm requires only
O(|S|)
extra space to run.
xi , i = 1, 2, . . . N,
For example,
P ( A A T C C G) = P ( T C A A C G)
= P ( T G A A C C)
= P ( A C T A C G)
= P ( A C C A T G)
..
.
= (prA )2 prT (prC )2 prG .
5.7 Conclusion
What scheme best approximates a DNA fragment? What scheme best
approximates a protein sequence? These are difficult questions to answer
satisfactorily. It is best to understand the problem at hand and use a scheme
that is simple enough to be tractable and complex enough to be realistic.
Of course, it is unclear whether a more complex scheme is indeed a closer
approximation to reality. The design of an appropriate model for a biological
sequence continues to be a hot area of research.
134 Pattern Discovery in Bioinformatics: Theory & Algorithms
5.8 Exercises
Exercise 35 (Constructive model) Let
and let
r0 = 0,
rj = rj−1 + 1000 praj , for each 0 < j ≤ N .
rj−1 ≤ r < rj .
(a) for a fixed position k the probability of s[k] taking the value σ is
1
,
|Σ|
for each σ ∈ Σ.
(b) for a fixed σ ∈ Σ, the probability of position k taking the value σ is
1
,
|Σ|
for each k.
Hint: Note that the sum of two random numbers is random and the product
of two random numbers is random. Show that a random string/permutaion
produced by the second model satisfies the properties of the first model.
(Σ, S, P, E),
show that X
P (xi ) = 1,
i
where xi is any observed string of length of fixed length say k and P (xi ) is the
marginal probability of xi i.e.,
X
P (xi ) = P (xi |zj ),
j
(Σ, S, P, E).
Let Ω be the space of all strings on Σ of length k. Show that the Kolomogorov’s
conditions hold for this probability distribution.
1. P (ω ∈ Ω) ≥ 0 and
2. P (Ω) = 1.
x = x1 x2 x3 . . . xn
(i.e. a sequence of emitted symbols from Σ). What is the (overall) probability
of observing x under this model?
Hint: Note that there are many HMM paths of length n that emit x. How
many such paths exist?
Exercise 43 Prove Eqns (5.6) and (5.7) in The Decoder Optimal Subproblem
Lemma (5.1).
ai , 1 ≤ i ≤ 20.
Let
#(ai aj )
denote the number of times the substring ai aj is observed in D.
For each i, the following is computed:
X
ni = #(ai aj ),
j
X
n= ni .
i
#(ai aj )
P[i, j] = .
ni
6.1 Introduction
One dimensional data is about the simplest organization of information.
It is amazing that deoxyribonucleic acid (DNA) that contains the genetic in-
structions for the entire development and functioning of living organisms, even
as complex as humans, is only linear. The genome, also called the blueprint of
the organism, encodes the instructions to build other components of the cell,
such as proteins and RNA molecules that eventually make up ‘life’.
As we have seen in Chapter 5 the biopolymers can be modeled as strings
(with a left-to-right direction). In this chapter we explore the problem of spec-
ifying string patterns. We begin with a few examples of patterns in biological
sequences.
Solid patterns. The ROX1 gene encodes a repressor of the hypoxic func-
tions of the yeast Saccharomyces cerevisiae [BLZ93]. The DNA binding motif
is recognized as follows:
CCATTGTTCTC
This pattern can also be extracted from DNA sequences of the transcriptional
factor ROX1.
Rigid patterns (with wild cards). One of the yeast genes required for
growth on galactose is called GAL4. Each of the UAS genes contains one or
more copies of a related 17-bp sequence called UASGAL,
G.CAAAA.CCGC.GGCGG.A.T
139
140 Pattern Discovery in Bioinformatics: Theory & Algorithms
C..PF.[FYWI].......C-(8,10)WC....[DNSR][FYW]-(3,5)[FYW].[FYWI]C
6.2 Notation
Let the input s be a string of length n be over a finite discrete alphabet
Σ = {σ1 , σ2 , . . . , σL },
s[i..j], 1 ≤ i ≤ j ≤ n,
is the string obtained by yanking out the segment from index i to j from s.
A character from the alphabet,
σ ∈ Σ,
is called a solid character. A wild card or a ‘dont care’ is denoted as the ‘.’
character. This is to be interpreted as any character from the alphabet at
that position in the pattern. The size of the pattern is given as
|p |,
String Pattern Specifications 141
This is a brief introduction and the details are to be presented for each class
of patterns in their respective sections. For a fixed alignment (j1 , j2 ), we use
the following terminology.
1. p is usually defined by removing the leading and trailing dont care el-
ements, so that its leftmost and rightmost element is a solid character.
Also, let |p |> 1. In the rest, we assume p is nontrivial.
2. p is called the meet of q1 and q2 with alignment (j1 , j2 ). Also,
p q1 and p q2 .
|Lp | = k,
we also say that the pattern p has a support of k in s. For any pattern p,
|Lp |≥ 2.
1 ≤ j ≤ k ≤ n − |p |,
such that
p = s[j..k].
The solid pattern is also known as an l-mer in literature, where |p |= l.
Consider the example shown in Figure 6.1 which gives P and the location
list Lp , for each p ∈ P . It is not difficult to see that the number of nontrivial
solid patterns is
O(n2 ).
We next define maximal patterns that helps remove some repetitive patterns
in P .
String Pattern Specifications 143
s1 = a b c d a b c d a b c a b.
a b c d a b c, b c d a b c, c d a b c, d a b c,
a b c d a b, b c d a b, c d a b, d a b,
a b c d a, b c d a, c d a, d a,
P = .
a b c d, b c d, c d,
a b c, b c,
a b.
La b c d a b c = La b c d a b
= La b c d a Lb c d a b c = Lb c d a b
= La b c d = Lb c d a
= {1, 5}. = Lb c d
= {2, 6}.
La b c = {1, 5, 9}.
Lb c = {2, 6, 10}.
La b = {1, 5, 9, 12}.
Lc d a b c = Lc d a b
Ld a b c = Ld a b
= Lc d a
= Ld a
= Lc d
= {4, 8}.
= {3, 7}.
6.3.1 Maximality
A maximal solid pattern p is maximal in length, if p such cannot be extended
to the left or to the right, without decreasing its support, to get a nontrivial
p′ 6= p.
a b c d a b, a b c d a, a b c d,
are not maximal in s1 since each can be extended to the right with the same
support (i.e., k = 2). By this definition, what are the maximal patterns? Let
$ 6∈ Σ.
We terminate s with $ as
s$,
1 An alternative proof is from [BBE+ 85]: It descends from the observation that for any two
substrings p1 and p2 of s, if
Lp1 ∩ Lp2 6= ∅,
then p1 is a prefix of p2 or vice versa.
String Pattern Specifications 145
ab dabc
b c
ab$ 12 13
dabcab$
11 8 4
dabc dabc ab$
ab$
dabcab$ 9 10 3 7
dabcab$
ab$ ab$
1 5 2 6
for reasons that will soon become obvious. This is best explained with a
concrete example. A suffix of s, sufi (1 ≤ i ≤ n), is defined as
sufi = s[i..n].
s1 = a b c d a b c d a b c a b $.
T (s1 ),
is a compact trie of all the 13 suffixes and is shown in Figure 6.2. The direction
of each edge is downwards. It has three kinds of nodes.
1. The square node is the root node. It has no incoming edges.
2. The internal nodes are shown as solid circles. Each has a unique incom-
ing edge and multiple outgoing edges.
3. The leaf nodes are shown as hollow circles. A leaf node has no outgoing
edges.
The edges of T (s) are labeled by nonempty substrings of s and the tree satisfies
the following properties.
1. The tree has exactly n leaf nodes are labeled bijectively by the integers
1, 2, . . . , n.
String Pattern Specifications 147
The first property follows from the definition of the suffix tree and we have
seen the second property in Chapter 2. We now make the crucial observation
about the suffix tree.
THEOREM 6.1
(Suffix tree theorem) Given s,
p ∈ Pmaximal (s)
can be obtained by reading off the labels of a path from the root node to an
internal node in T (s).
pth(v)
denote the string which is the label on the path from the root node to v.
Each maximal pattern is be a prefix of some suffix,
sufi , 1 ≤ i < n,
hence can be read off from the (prefix) of the label of a path from the root
node.
Let p be a maximal motif. If
p = pth(v),
148 Pattern Discovery in Bioinformatics: Theory & Algorithms
then v cannot be a leaf node since then the motif must end in the terminating
symbol $.
Assume the contrary, that there does not exist any internal node v such
that
p = pth(v).
Hence p must end in the middle of an edge label, say r where the edge is
incident on some internal node v ′ . Let
r = r1 r2 ,
p r2
p = pth(v).
p ∈ Pmaximal (s1 )
p1 = b c,
p2 = b c d a b c,
p3 = c d a b c,
p4 = d a b c,
and
p1 , p2 , p3 , p4 6∈ Pmaximal (s1 ).
Notice in Figure 6.2 that for each pi , ≤ i ≤ 4,, there exists some node vi in
T (s1 ) such that
pth(vi ) = pi .
Thus the pattern corresponding to
p = pth(v),
for any arbitrary internal node v is not maximal. Since we only need an upper
bound on |Pmaximal (s)|, this is not a concern.
String Pattern Specifications 149
THEOREM 6.2
(Maximal solid pattern theorem) Given a string s of size n, the number
of maximal solid patterns in s is no more than n, i.e.,
This directly follows from Property (2) of suffix trees and Theorem (6.1).
Σ + ‘.’
The pattern is called rigid since the length it covers in all its occurrences is the
same (in the next section we study patterns that occupy different lengths).
Next, we define the occurrence of a rigid motif. We first define the following.
Let
σ1 , σ2 ∈ Σ.
Then we define a partial order relation () as follows.
If (σ1 = ‘.′ ) ⇔ σ1 ≺ σ2 .
If (σ1 ≺ σ2 ) OR (σ1 = σ2 ) ⇔ σ1 σ2 .
p[l] s[j + l − 1]
[j, j + |p | −1]
|p |= |q|,
we say
p q,
150 Pattern Discovery in Bioinformatics: Theory & Algorithms
p[l] q[l].
s1 = a b c d a b c d a b c a b.
Let P (s1 ) be the set of all nontrivial rigid patterns and we consider a subset,
P ′ (s1 ), as shown below.
c d a b c, c d a b, c d a, c d
c . a b c, c . a b, c . a,
c d . b c, c d . b,
c d a . c, c . . b,
′
P (s1 ) = ⊂ P (s1 ).
c . . b c,
c d . . c,
c . a . c,
c . . . c,
Lp = {3, 7}.
Lq = {3, 7},
then
q ∈ P ′ (s1 ).
p size Lp
|←− ℓ −→|
c c c c ··· c ccc ℓ {1, ℓ + 2}
c c c ··· c c c c ℓ−1 {1, 2, ℓ + 2, ℓ + 3}
c c ··· c c c c ℓ−2 {1, 2, 3, ℓ + 2, ℓ + 3, ℓ + 4}
c ··· c c c c ℓ−3 {1, 2, 3, 4, ℓ + 2, ℓ + 3, ℓ + 4, ℓ + 5}
.. .. ..
. . .
c ccc 4 {1, 2, . . . , ℓ − 3, ℓ + 2, ℓ + 3, . . . , 2ℓ − 3}
ccc 3 {1, 2, . . . , ℓ − 2, ℓ + 2, ℓ + 3, . . . , 2ℓ − 2}
cc 2 {1, 2, . . . , ℓ − 1, ℓ + 2, ℓ + 3, . . . , 2ℓ − 1}
n = 2ℓ + 1,
ℓ − 2.
String Pattern Specifications 153
p Lp i
|←− ℓ −→|
c . c c ··· c c c c {1, ℓ} 2
c c . c · · · c c c c {1, ℓ − 1} 3
c c c . · · · c c c c {1, ℓ − 2} 4
.. .. ..
. . .
c c c c ··· . c c c {1, 5} ℓ − 3
c c c c ··· c . c c {1, 4} ℓ − 2
c c c c ··· c c . c {1, 3} ℓ − 1
FIGURE 6.4: The longest patterns with exactly one ‘.’ character at posi-
tion i in the pattern in s3 . Each is of length ℓ and is maximal in composition
but not in length.
It is interesting to note that the longest pattern with exactly one ‘.’ character
is not maximal. Each pattern listed above is maximal in composition but not
maximal in length. In fact, there is no maximal pattern with exactly one ‘.’
character in s3 .
Figure 6.5 shows the maximal patterns with exactly two ‘.’ elements at
positions i and j of the pattern. The maximal pattern can be written as p2i ,
which has an ‘.’ character at position i and position ℓ + 1 of the pattern. p2i
is of length ℓ + i − 1, and
Lp2i = {1, ℓ − (i − 2)}.
The number of such maximal patterns is ℓ − 1.
Figure 6.7 shows a few examples of patterns with more than two ‘.’ elements
and each of these can be constructed from the patterns of Figure 6.5. It can
be verified that the number of such maximal patterns with j + 1 ‘.’ elements
is
ℓ−1
.
j
Thus including the maximal patterns with no ‘.’ elements and exactly two ‘.’
elements, the total number of maximal patterns is give as:
ℓ−3
X ℓ−1
2(ℓ − 1) + .
j=3
j
Note that
n
ℓ≈ ,
2
thus the number of maximal patterns is
O(2n ).
154 Pattern Discovery in Bioinformatics: Theory & Algorithms
p Lp i, j
|←− ℓ + 2 −→|
{1, ℓ} 2, ℓ + 1
c . c ··· c . c
|←− ℓ + 3 −→|
{1, ℓ − 1} 3, ℓ + 1
c c . c ··· c . c c
|←− 2ℓ − 2 −→|
{1, 4} ℓ − 2, ℓ + 1
c c c ··· c . c c c . c ···c c c
|←− 2ℓ − 1 −→|
{1, 3} ℓ − 1, ℓ + 1
c c c c ··· c c . c . c c ···c c c c
|←− 2ℓ −→|
{1, 2} ℓ, ℓ + 1
c c c c ··· c c c . . c c c ···c c c c
FIGURE 6.5: All maximal patterns in s3 with two ‘.’ elements at positions
i and j of the pattern. Note that the length of each pattern is > ℓ + 1.
p p Lp
←− ℓ+2 −→ ←− ℓ+2 −→
c • ccc- - - c• c c • c c c - - - c • c 1, ℓ
c c •cc- - - c• c c c c • c c - - - c • c c 1, ℓ-1
c c c•c- - - c• c cc c c c • c - - - c • c c c 1, ℓ-2
.. ..
. .
c c c - -c• c c• c c ---c ccc- - c •cc • cc- - - c 1, 4
c c c - -cc • c• c c ---cc c cc- - c c •c• c c- - - c c 1, 3
c c c - -cc c •• c c ---cc c c c c- -c c c ••c c - - - c c c 1, 2
←− ℓ − 1 −→ 2 ←− ℓ − 1 −→ ←− ℓ−1 −→ 2 ←− ℓ−1 −→
←− 2ℓ −→ ←− 2ℓ −→
FIGURE 6.6: The ℓ+1 collection of maximal patterns with two ‘.’ elements
(shown as •) on s3 have been stacked in two different ways (left flushed and
right flushed) to reveal the ‘pattern’ of their arrangement within the maximal
patterns.
String Pattern Specifications 155
p Lp i, j, k
|←− ℓ + 2 −→|
{1, ℓ − 1, ℓ} 2, ℓ, ℓ + 1
c . c c ··· c . . c
|←− ℓ + 2 −→|
{1, 2, ℓ} 2, 3, ℓ + 1
c . . c ··· c c . c
|←− ℓ + 2 −→|
{1, ℓ − 2, ℓ} 2, ℓ − 1, ℓ + 1
c . c c ··· . c . c
|←− ℓ + 2 −→|
{1, 3, ℓ} 2, 4, ℓ + 1
c . c . ··· c c . c
|←− ℓ + 2 −→|
{1, 4, ℓ} 2, 5, ℓ + 1
c . c c . ··· c c . c
|←− ℓ + 3 −→|
{1, 4, ℓ − 1} 3, 6, ℓ + 1
c c . c c . ··· c . c c
|←− ℓ + 4 −→|
{1, 5, ℓ − 2} 4, 8, ℓ + 1
c c c . c c c . ··· . c c c
|←− 2ℓ − 3 −→|
c c c c··· . c c . . ···c c c c {1, 4, 5} ℓ − 3, ℓ, ℓ + 1
|← ℓ−4 →| |← ℓ−4 →|
|←− 2ℓ − 3 −→|
c c c c··· . c . c . ···c c c c {1, 3, 5} ℓ − 3, ℓ − 1, ℓ + 1
|← ℓ−4 →| |← ℓ−4 →|
|←− 2ℓ − 2 −→|
c c c c c··· . c . . ···c c c c c {1, 3, 4} ℓ − 2, ℓ, ℓ + 1
|← ℓ−3 →| |← ℓ−3 →|
FIGURE 6.7: Some maximal patterns in s3 with three ‘.’ elements each
at positions i, j and k of the pattern.
156 Pattern Discovery in Bioinformatics: Theory & Algorithms
p1 =c . . ccc. c ×
p2 =c . c. c. . c ×
p3 =c . . . . . . c ×
√
p4 =c . c. c. cc
√
p5 =c ccccccc
s3 = c c c · · · c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→|
s4 = c a c a c a · · · c a c a c a g c a c a c a · · · c a c a c a
|←− 2ℓ −→| |←− 2ℓ −→|
x ∈ Σ + ‘.′
with
x a.
For example, p1 and p2 are maximal motifs in s3 below. Notice that p′1 and
p′2 meet the density constraint (d = 1).
The proof of the claim is straightforward and we leave that as an exercise for
the reader (Exercise 54). Thus the number of maximal density-constrained
(d = 1) rigid patterns in s4 is also
O(2n ).
|Lp | ≥ k.
Although, this filters out patterns that occur less than k times, it is not
sufficient to bring the count of the patterns down. We next show that even
158 Pattern Discovery in Bioinformatics: Theory & Algorithms
for the quorum-constrained rigid patterns, the number of maximal motifs can
be very large.
Consider the input string s3 of Section 6.4.1. The number of quorum con-
strained maximal motifs can be verified to be at least:
ℓ−3
X ℓ−1
(ℓ − k + 1) + .
j
j=k
The patterns can be constrained to meet both the density and quorum
requirements, yet the number of maximal patterns could be very large. This
is demonstrated by calculating the number patterns on s4 . We leave this as
an exercise for the reader.
{c, a}
and s4 ’s alphabet is
{c, a, g}.
Both the sets are fairly small. Is it possible that if the alphabet is large, say,
Σ = O(n),
and
Σ = {e, 0, 1, 2, . . . , ℓ−1, ℓ}.
For convenience, denote
ℓi = ℓ − i.
Next we construct s5 and for convenience display the string in ℓ rows as
follows:
s5 = 0 1 2···3 ℓ3 ℓ2 ℓ1 ℓ
0 e 2···3 ℓ3 ℓ2 ℓ1 ℓ
0 1 e···3 ℓ3 ℓ2 ℓ1 ℓ
0 1 2···e ℓ3 ℓ2 ℓ1 ℓ
..
.
0 1 2 3 ··· e ℓ2 ℓ1 ℓ
0 1 2 3 ··· ℓ3 e ℓ1 ℓ
0 1 2 3 ··· ℓ3 ℓ2 e ℓ
String Pattern Specifications 159
D ⊆ {1, 2, . . . , ℓ1 } and
q = 0 1 2 3 · · · ℓ3 ℓ2 ℓ1 ℓ
q(D1 ) 6= q(D2 ).
pD [1 . . . 1 + ℓ] = q(D).
(ℓ + 1)(j − 1) + (j + 1) ∈ LpD .
(c)
|LpD | = |D| + 1.
3. The number of distinct such D’s is
O 2ℓ−1 .
To understand these claims, consider the case where ℓ = 5. Then s′5 (from
s5 ) is constructed as below:
s′5 = 0 1 2 3 4 5 0 e 2 3 4 5 0 1 e 3 4 5 0 1 2 e 4 5 0 1 2 3 e 5
↑ ↑ ↑ ↑ ↑
i= 1 7 13 19 25
160 Pattern Discovery in Bioinformatics: Theory & Algorithms
Then all possible D sets and the corresponding q(D) are shown below:
The maximal pattern pD and its location list LpD for four cases are shown
below.
q(D) pD LpD
012. 45 012. 4 50. 23. 5 {1, 19}
01. . 45 01. . 4 50. 2. . 5 {1, 13, 19}
0. . 3. 5 0. . 3. 5 {1, 7, 13, 25}
0. . . . 5 0. . . . 5 {1, 7, 13, 19, 25}
We leave the proof of the claims for a general ℓ as an exercise for the reader
(Exercise 55).
√
Recall that ℓ = n. Thus the number of maximal rigid patterns in s5 is
1
O 2n 2 .
and for
p 6= pi ∈ Pmaximal (s), i = 1, 2, . . . l,
and the support of p is obtained from the support of the p′i s, i.e.,
However, there are some details hidden in the notation. We bring these out
in the following concrete example. For some input s, let
p0 , p1 , p2 , p3 , p4 ∈ Pmaximal (s).
p1 G T T . G A
p2 G G G T G G A C C C
p3 GT . GACC
p4 GTTGAC
p0 . . . T . G A . . .
L + j = {i + j |∈ L}.
Lp0 = Lp1 +3
S
Lp +4
S 2
Lp +2
S 3
Lp4 + 2.
Next, we state the central theorem of this section [AP04]. We begin with a
definition that the theorem uses. Given an input s of length n, for 1 < i < n,
let ari be defined as follows:2
THEOREM 6.3
(Basis theorem)[AP04] Given an input s of length n,
THEOREM 6.4
(Density-constrained basis theorem) Given s of length n, let
(d)
Pbasis (s)
denote the basis for the collection of rigid patterns that satisfy some density
constraint d > 0. Then
(d)
|Pbasis (s)| < n2 .
An informal argument is as follows. Each ari , 1 < i < n, may get frag-
mented at regions where the density constraint is not met. For example, let
density constraint d = 2 and consider
ari = c a c . . g . . . c c . c . c a t . . . c t.
String Pattern Specifications 163
ari1 = c a c . . g
ari2 = c c . c . c a t
ari3 = c t
The number of such fragments for each ari is no more than n. Hence the
size of the basis is
O(n2 ).
THEOREM 6.5
(Quorum-constrained basis theorem) [AP04, PCGS05] Given s, let
(k)
Pbasis (s)
denote the basis for the collection of rigid patterns that satisfy quorum k (> 1).
Then :
s ⊗ s ⊗ . . . ⊗ s,
(k)
Pbasis (s) = for alignment 1 < i2 < . . . < ik < n .
(1, i2 , i3 , . . . , ik )
Thus
(k)
|Pbasis (s)| < (n − 1)k−1 .
This follows from the proof of Theorem (6.3). However, to show that such a
bound is actually attained, consider the concrete example of s3 of Section 6.4.1
and quorum k = 3. See Figure 6.7 for some maximal patterns p with
|Lp | = 3.
6.6 Generalizations
Here we discuss two simple and straightforward generalizations of string
patterns: one where an element of the input is replaced by a set of elements
(called homologous sets) and the second where the input is a sequence of real
values.
Note that
Σ = {A, L, G, T}.
The input is interpreted as follows.
Thus each element, s[j], is a set for 1 ≤ j ≤ n. Usually, only a certain subset of
elements can appear at a position and they are called homologous characters.
For example, some homologous (groups) amino acids are shown below:
[L I V M]
[L I]
[F Y W]
[A S G]
[A T D]
x y ⇔ (x ⊆ y),
and
‘.′ x.
Following the convention that, i ∈ Lp , is the leftmost position (index) in s6
of an occurrence of solid motif
In fact there are uncountably infinite patterns with this location list. To
circumvent this problem, we allow the patterns to draw their alphabets not
from real numbers but from closed real intervals. For example, in this case
the first character is replaced by the interval
(0.69, 0.81)
and the unique pattern corresponding to this location list is:
p = (0.69, 0.81) • 2.2, Lp = {1, 4, 7}.
The partial order on an interval, (x, y), is defined naturally as follows (x1 , x2 ,
y1 , y2 are reals),
((x1 , x2 ) (y1 , y2 )) ⇔ ((x =r y) for each x1 ≤ x ≤ x2 and y1 ≤ y ≤ y2 ) .
Thus for i = 1, . . . , 5,
pi p and Lp = Lpi .
In other words each pi is redundant (or nonmaximal) w.r.t. p.
(1) s = 7 8 10 4 1 6
a a
b b
c c
d d
e e
f f
g g
h
1 4 6 7 8 10 1 4 6 7 8 10
(2a) δ = 3. (3a) δ = 2.
1 ↔ {a} 1 ↔ {a}
4 ↔ {b,c} 4 ↔ {b}
6 ↔ {c,d,e} 6 ↔ {c,d}
7 ↔ {d,e,f} 7 ↔ {d,e}
8 ↔ {e,f,g} 8 ↔ {e,f}
10 ↔ {g,h} 10 ↔ {g}
(2b) (3b)
s3 = [d e f] [e f g] [g h] [b c] a [c d e] s2 = [d e] [e f] g b a [c d]
(2c) (3c)
a ↔ (−0.5, 2.5)
a ↔ (0, 2)
b ↔ (2.5, 4.5)
b ↔ (3, 5)
c ↔ (4.5, 5.5)
c ↔ (5, 6)
d ↔ (5.5, 6.5)
d ↔ (6, 7)
e ↔ (6.5, 7.5)
e ↔ (7, 8)
f ↔ (7.5, 8.5)
f ↔ (8, 9)
g ↔ (8.5, 9.5)
g ↔ (9, 11)
h ↔ (9.5, 11.5)
(2d) (3d)
FIGURE 6.8: The input real sequence s is shown in (1). For δ = 3: (2a)-
(2b) show the mapping of the reals on the real line to the discrete alphabet.
(2c) shows the string sδ on the discrete alphabet. (2d) shows the mappings
of the discrete alphabet to real (closed) intervals. For δ = 2, the same steps
are shown in (3a)-(3d).
170 Pattern Discovery in Bioinformatics: Theory & Algorithms
6.7 Exercises
Exercise 46 Consider the solid pattern definition of Section 6.3.
|s| > 1,
such that the root node in its suffix tree has exactly one child?
Hint:
$
a
$
a
$
a
$
a$
2. For each node v in the suffix tree T (s), characterize the set {pth(v)}.
Note that
Pmaximal (s) ⊂ {pth(v)}.
String Pattern Specifications 171
Hint: 1. What if, sufi = s[j..k] holds for some 1 ≤ j < k ≤ n? 2. Is it the
collection of all nonmaximal patterns?
Exercise 49 In Section 6.3, we saw that the number of maximal solid pat-
terns, Pmaximal (s), in s of length n satisfies
|Pmaximal (s′ )| = n − 1.
Exercise 50 Consider the rigid pattern definition of Section 6.4. Show that
the maximality of a rigid pattern as defined in this section is equivalent to the
definition of maximality of Chapter 4.
Exercise 51 Let
s = a b c d a b c d a b c a b.
We follow the convention that, i ∈ Lp , is the leftmost position (index) in s1
of an occurrence of p. Enumerate all the rigid patterns p in s such that
1. Lp = {1, 5},
2. Lp = {1, 5, 9},
3. Lp = {1, 5, 9, 12},
4. Lp = {2, 6},
5. Lp = {2, 6, 10},
6. Lp = {3, 7},
7. Lp = {4, 8}.
s3 = c c c · · · c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→|
4. Show that there is no nontrivial maximal pattern with only one ‘.’ char-
acter.
Exercise 53 Consider the input string s3 of Section 6.4.1. Let the patterns
satisfy quorum k. Show that the number of quorum constrained maximal mo-
tifs is at least:
ℓ−3
X ℓ−1
(ℓ − k + 1) + .
j
j=k
Hint: The patterns without dont cares and meeting the quorum constraint
is ℓ − k + 1.
s3 = c c c · · · c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→|
s4 = c a c a c a · · · c a c a c a g c a c a c a · · · c a c a c a
|←− 2ℓ −→| |←− 2ℓ −→|
Further, given p, p′ is constructed from p by replacing each element
x ∈ Σ + ‘.′
of p with
x a.
Then show that the following statements hold.
c c c c ··· c c c c . . . . . ... . . . . . c
p2 =
|←− ℓ − 1 −→| |←− ℓ −→| 1
c a c a c a ··· c a c a c a . a . a . a ··· . a . a . a c a
p′2 =
|←− 2ℓ − 2 −→| |←− 2ℓ −→| 2
O(2n ).
Hint: 5. Use the fact that the number of maximal patterns in s3 is exponen-
tial.
D 6= ∅.
O 2ℓ−1 .
(a)
1 ∈ LpD .
(ℓ + 1)(j − 1) + (j + 1) ∈ LpD .
174 Pattern Discovery in Bioinformatics: Theory & Algorithms
(c)
|LpD | = |D| + 1.
s8 = c c c · · · c c c a c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→|
s9 = c c c · · · c c c a c c c · · · c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→| |←− ℓ −→|
Hint: What are patterns with 0,1,2, .... dont care elements? Since the
patterns are maximal, consider the autocorrelations of the input sequences.
0 ≤ δ1 ≤ δ2 .
3. Show that if
[σ1 , σ2 , . . . , σm ]
is a homologous set resulting from the construction with the mappings:
σ1 ↔ (l1 , u1 )
σ2 ↔ (l2 , u2 )
..
.
σm ↔ (lm , um )
String Pattern Specifications 175
u1 = l 2 ,
u2 = l 3 ,
..
.
um−1 = lm .
[σ1 , σ2 , . . . , σm ] ↔ (l1 , um ).
[a b] [b c d] [d c e] [e f g] [g h i] [i j].
2.
If {p1 , p2 , . . . , pl } ֒→ p0 , then p0 pi , i = 1, 2, . . . , l.
p1 = GT TGGA
p2 = GG GTGGACCC
p3 = GT . GACC
p4 = GT TGAC
Four meet operations with the alignments are shown below (see Section 6.4.6):
q1 = p2 ⊗ p3 ⊗ p4
q2 = p2 ⊗ p3
p2 G G G T GGACCC
p2 G G G T G G A C C C
p3 GT . GACC
p3 GT . GACC
p4 GT TGAC
q2 . . G T . G A C C .
q1 . . G T . GAC . .
q3 = p1 ⊗ p2 ⊗ p3 ⊗ p4 q4 = p1 ⊗ p2 ⊗ p3 ⊗ p4
p1 G T T G G A p1 GTTGGA
p2 G G G T G G A C C C p2 G G G T G G A C C C
p3 GT . GACC p3 GT . GACC
p4 GTTGAC p4 GTTGAC
q3 . . . T . G A . . . q4 . . G T . G . . . .
4. Show that
Pbasis = {ari | 1 < i < n}.
Hint: 1. Use proof by contradiction. 2. See Exercise 62. 3. Follows from the
definition of maximality and the meet operator ⊗. Show the following.
p = ari2 −i1 +1
= s ⊗ s, for alignment (1, i2 − i1 + 1).
p = s ⊗ s ⊗ . . . ⊗ s,
for alignment (1, i2 − i1 + 1, i3 − i1 + 1, . . . , ik − i1 + 1)
= ari2 ⊗ ari3 ⊗ . . . ⊗ arik ,
for alignment (1, 1, . . . , 1).
s3 = c c c · · · c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→|
2. |Pbasis | = 2ℓ − 2.
Hint: s3 ’s basis is:
c c
c c c
c c cc
..
.
c c c- - cc
c c c- - cc c
c c c- - cc c c
c . cc c - - - c.c
c c .c c - - - c.c c
c c c. c - - - c.c c c
..
.
c c c- -c. c c.c c ---c
c c c- -cc . c.c c ---cc
c c c- -cc c ..c c ---cc c
←− ℓ − 1 −→ 2 ←− ℓ − 1 −→
O(n2 )
See [AP04] for an efficient incremental algorithm. See also [PCGS05] for a
nice exposition.
s = A T C G A T A.
p=A.C−A?
String Pattern Specifications 179
ATCGA TA
ATCGATA
|Lp′ | ≥ 2.
Hint: Let
Lp = {i1 , i2 , . . . , ik−1 , ik }.
1. Let
p′ = p−p
Then
Lp′ = {i1 , i2 , . . . , ik−1 }.
2. Let
p′ = p−p−p
Then
Lp′ = {i1 , i2 , . . . , ik−2 }.
And, so on.
Hint: Let the quorum be k(≥ 2). If some pi is such that it occurs less than
k times and this pi is used in more than one occurrence of p.
180 Pattern Discovery in Bioinformatics: Theory & Algorithms
Hint:
p1 = A G A − C T A A − A . G − A
p2 = G − − − T G A A A − − A A C . G
p= G−−−T . A−A−−−A
s = A C G G T T C,
then
s̄ = C T T G G C A.
|p | −1
jc = j + .
2
Note that jc may not always be an integer, but that does not matter. We
follow the convention that, i ∈ Lp , is the center of the occurrence of p
in s. If p is a pattern in s, then show the following:
|P (s)| = |P (s̄)|,
|Pmaximal (s)| = |Pmaximal (s̄)|,
|Pbasis (s)| = |Pbasis (s̄)|.
String Pattern Specifications 181
Comments
String patterns is about the simplest idea, in terms of its definition, in
bioinformatics. It is humbling to realize how complicated the implications of
simplicity can be. This area has gained a lot from a vibrant field of research
in computer science, called stringology (not to be confused with string theory,
from high energy physics).
Chapter 7
Algorithms & Pattern Statistics
7.1 Introduction
In the previous chapter, we described a whole array of possible character-
izations of patterns, starting from the simple l-mer (solid patterns) to rigid
patterns with dont care characters to extensible patterns with variable length
gaps. Further, the element of a pattern could be drawn from homologous sets
(multi-sets). In this chapter we take these intuitive definitions to fruition by
designing practical discovery algorithms and devising measures to evaluate
the significance of the results.
Input: The input is a string s of size n and two positive integers, density
constraint d and quorum k > 1.
Output: The density (or extensibility) parameter d is interpreted as the
maximum size of the gap between two consecutive solid characters in a pat-
tern. The output is all maximal extensible patterns that occur at least k times
in s.
The algorithm can be adapted to extract rigid motifs as a special case.
For this, is suffices to interpret d as the maximum number of dot characters
between two consecutive solid characters.
183
184 Pattern Discovery in Bioinformatics: Theory & Algorithms
Σ + ‘.’+‘-’
hσ1 , σ2 , ℓi ,
is defined as follows:
1. p̂ begins in σ1 and ends in σ2 where
σ1 , σ2 ⊆ Σ.
p = A G . . C − T . [G C]
p̂ hσ1 , σ2 , ℓi
AG hA, G, 0i
G.. C hG, C, 2i
C−T hC, T, −1i
T . [G C] hT, [G C], 1i
This will be also used later in the probability computations of the patterns.
Initialization Phase: The cell is the smallest extensible component of a
maximal pattern and the string can be viewed as a sequence of overlapping
cells. The initialization phase has the following steps.
Step 1: Construct patterns that have exactly two solid characters in them
and separated by no more than d spaces or ‘.’ characters. This is done by
scanning the string s from left to right. Further, for each location, the start
and end position of the cell are also stored.
Step 2: The extensible cells are constructed by combining all the cells with
at least one dot character and the same start and end solid characters. The
location list is updated to reflect the start and end position of each occurrence.
Algorithms & Pattern Statistics 185
p = p1 ⊕ p2
= C. G . . T
Note that p2 is not compatible with p1 . Also, the locations list of p is appro-
priately generated as follows:
1. Amongst all possible candidate cells, always pick the most saturated
one (at each step). This ensures that the patterns are generated in the
desirable order: If p′ is nonmaximal w.r.t. a maximal pattern p, then p
is always generated (emitted) before p′ .
The overall algorithm could either be simply iterative or recursive (to take
advantage of partial computations). We describe a recursive version below
(note that the ‘backtrack’ in the discussion can be implicitly captured by
recursive calls).
The details of the algorithm are left as an exercise for the reader (Exer-
cise 70) which can be gleaned from the concrete example discussed below.
The reader is also directed to [ACP05] for other details.
s= C A G C A G T C T C.
ordering of the cells and the cells are processed in the order displayed here.
A G, C A, G C, T C,
A . C, C T, G T, T . T,
A . T, C . G, G . A, T . . T,
A . . A, C . C, G . C, T − T.
′
Blef t = ,
A . . C, C . . C, G . . G,
A − C, C . . T, G . . T,
C − C, G − C,
C − T, G − T,
C A, G C, A G, G T,
G . A, T C, C . G, C T,
A . . A, A. C, G . . G, A. T,
C. C, T. T,
G. C, C.. T,
′
Bright = C.. C, G.. T, .
A.. C, C− T,
T.. C, G− T,
A− C, T− T.
C− C,
G− C,
To avoid clutter, the cells that do not meet the quorum constraints have been
removed to produce Blef t and Bright . See Exercise 69 (1) for a mild warning
about this step.
A G, C A, G − C, T C.
A − C, C . G, G − T,
Blef t = ,
C − C,
C − T,
C A, T C, A G, C − T,
A− C, C . G. G − T.
Bright = .
C− C,
G− C,
Note that
(i, j) ∈ L
denotes that the cell begins at position i and ends at position j in the input.
Again, to avoid clutter, we do not enumerate all the location lists of the cells.
188 Pattern Discovery in Bioinformatics: Theory & Algorithms
We show only a few examples of cells with their location lists below.
We now show the steps involved in constructing the maximal extensible pat-
terns.
1. (Pick cell in order of saturation) Consider the cells that start with
A and have at least two occurrences. Then we have the following:
p1 = A G
p2 = A − C
What should the first choice be? Between the two, pattern p1 is more
‘saturated’ than p2 and p1 is picked first.
p1 = A G
We pick p3 .
q1 = p1 ⊕ p3
= A G − C , and
Lq1 = {(2, 4), (5, 8)}.
3. (Explore right) We continue to explore the right and search for cells,
p, such that q1 is compatible with p. We look in Blef t for cells that start
with C.
p5 = C A
p6 = C . G
p7 = C − C
p8 = C − T
Adding p5 , p6 , p7 does not meet the quorum requirements. The only
option is p8 .
q2 = q1 ⊕ p8
= AG−C−T , and
Lq2 = {(2, 7), (5, 9)}.
Algorithms & Pattern Statistics 189
4. (Explore right) We continue to explore the right and search for cells,
p, such that q2 is compatible with p. We look in Blef t for cells that start
with T.
p9 = T C
q3 = q2 ⊕ p9
= AG−C−TC , and
L q3 = {(2, 8), (5, 10)}.
p5 = C A
Thus
q4 = p5 ⊕ q3
= CAG−C−TC , and
L q4 = {(1, 8), (4, 10)}.
q4 = C A G − C − T C
8. (Explore right) We are back in the state of step 2 and at this stage
we wish to extend
p1 = A G
to the right with
p4 = G − T.
But this does not meet the quorum constraint, so we explore the left
now.
p5 = C A
190 Pattern Discovery in Bioinformatics: Theory & Algorithms
Thus
q5 = p5 ⊕ p1
= C A G , and
L q5 = {(1, 3), (4, 6)}.
11. (NO emit) Before emitting q5 , it is checked for maximality against the
emitted pattern q4 and it turns out that q5 is nonmaximal w.r.t. q4 ,
hence it cannot be emitted.
p2 = A − C
13. (Explore right) We explore the right and search for cells, p, such that
p2 is compatible with p :
p5 =C A
p6 =C .G
p7 =C −C
p8 =C −T
q6 = p2 ⊕ p8
= A−C−T , and
L q6 = {(2, 7), (5, 9)}.
14. (Explore right) We continue to explore the right and search for cells,
p, such that q6 is compatible with p:
p9 = T C
Adding p9 does not meet the quorum requirements. No more cells can
be added to the right in q6 .
15. (Explore left) Now we try to add cells to the left of q6 and look for
cells p such that p and q6 are compatible:
p5 = C A
Algorithms & Pattern Statistics 191
Thus
q7 = p5 ⊕ q6
= CA−C−T , and
Lq7 = {(1, 7), (4, 9)}.
However, no more cells can be added to the right or to the left of q7 .
16. (Emit maximal pattern) q7 is checked against q4 which was emitted
before. q7 is not nonmaximal w.r.t. q4 , hence emitted.
q7 = C A − C − T
17. (Repeat iteration) In fact, we should repeat the whole process with
cells starting with C, G and T as well. But we skip those details here.
See Exercises 70 and 71 for other details on the algorithm.
Note X
prσ = 1.
σ
Let the number of times σ appears in q be given by
kσ .
Then probability of occurrence of q, prq , is given as
Y
prq = (prσ )kσ . (7.1)
σ∈Σ
Thus, the dot character implicitly has a probability of 1. This fact alone
can raise some debate about the model, but we postpone this discussion to
Section 7.5.3.
Markov Chain
Next, we obtain the form of prq for a pattern when input q is assumed to
be generated by a Markov chain (see Chapter 5). For the derivation below,
we assume the Markov chain has order 1. Let
prσ(k)
1 ,σ2
{1, 2, . . . , d}.
α = {2, 4, 5, 7}.
αi , 1 ≤ i ≤ e.
q = A .{1,2} C .{2,3} G,
Algorithms & Pattern Statistics 195
Then e
Y Y
kσ
prq = (prσ ) |αi |. (7.3)
σ∈Σ i=1
But e
Y
|R(q)| = |αi |.
i=1
Hence,
Y
prq = |R(q)| (prσ )kσ . (7.4)
σ∈Σ
Markov chain
If q is a nondegenerate extensible pattern then,
X
prq = prq′ . (7.7)
q ′ ∈R(q ′ )
When sets of characters or homologous sets are used in patterns, the cell
is appropriately defined so that σ1 and σ2 are sets of homologous characters,
possibly singletons. Then the following holds.
X X Y X
prq = πσ P ℓ [σa , σb ] (7.9)
′
σ∈q[1] q ∈R(q) hσ1 , σ2 , ℓi σa ∈ σ1 ,
∈ C(q ′ ) σb ∈ σ2
M 3 (q) = { C 2 G G G}.
and so on.
Using Bonferroni’s inequalities (see Chapter 3), if k is odd, then a kth order
approximation of prq is an overestimate of prq .
Using
prdot < 1,
instead of
prdot = 1,
could be interpreted as a probabilistic way to to include a ”gap penalty” in
the previous formulation.
z(q) > α
or
z(q) < −α
are respectively overrepresented or underrepresented, or simply surprising.
7.6.1 z-score
Let prq be the probability of the pattern q occurring at any location i on
the input string s with
n = |s|
and let kq be the observed number of times it occurs on s. Assuming that the
occurrence of a pattern p at a site is an i.i.d. process, ([Wat95], Chapter 12),
for large n and kq ≪ n,
k − nprq
p q → N ormal(0, 1).
nprq (1 − prq )
See Chapter 3 for properties of normal distributions. Thus the z-score for a
pattern q is given as
kq − nprq
z(q) = p . (7.13)
nprq (1 − prq )
s1 , s2 , ..., st ,
Let kq be the observed number of sequences that contain q. Then the statisti-
cal significance of a given discrepancy between the observed and the estimated
is assessed by taking the χ-square ratio as follows:
(kq − ke )2
χ(q) = .
ke
and
kq1 = kq2 ,
where kq1 is the observed frequency of q1 and kq2 is that of q2 . Further let,
1
prq1 , prq2 < .
2
Then the z-scores of the two patterns satisfy the following [ACP05]:
z(q1 ) ≥ z(q2 ).
Algorithms & Pattern Statistics 201
1 7,60E+07 RA.T[LV].C.P-(2,3)G.HP....AC[ATD].L....[ASG]
2 21416,8 A..[LV].C.P-(2,3)G.HP-(1,2,4)[ASG].[ATD]
3 8105,33 A-(1,4)T....P-(2,3)G.HP....[ATD]-(3)L....[ASG]
4 5841,85 [ATD].T....P-(1,2,3)G.HP-(1,2,4)A.[ATD]
5 4707,62 P.[ASG]-(2,3,4)P....AC[ATD].L....[ASG]
6 4409,21 A..[LV]...P-(2,3)G.HP-(1,2,4)A.[ATD]
7 3086,17 P-(1,2,3)[ASG]..P-(4)AC[ATD].L....[ASG]
8 3068,18 R..[ATD]....P-(2,3)G.HP-(1,2,4)[ASG].[ATD]
9 2615,98 [ASG][ATD]-(1,3,4)P....AC[ATD].L....[ASG]
10 2569,66 [ASG]-(1,2,3,4)P....AC[ATD].L....[ASG]
11 2145,6 G-(2,3)P....AC[ATD].L....[ASG]
FIGURE 7.1: The functionally relevant motif is shown in bold for Strep-
tomyces subtilisin-type inhibitors signature (id PS00999). Here 20 sequences
of about 2500 bases were analyzed.
7.7 Applications
We conclude the chapter by showing some results on protein and DNA se-
quences obtained by using the ideas in the chapter. The experiments 1 involve
automatic extraction of significant extensible patterns from some suitable col-
lection of sequences. The interested reader is directed to [ACP07] for further
details.
1 295840 [LIM]-(1,2,3,4)[STA][FY]DPC[LIM][ASG]C[ASG].H
2 2,86E+05 [LIM]-(1,2,3,4)[ASG][FY]DPC[LIM][ASG]C[ASG].H
3 155736 R-(1,4)[FY]DPC[LIM][ASG]C[ASG].H
4 78829 [LIM]-(1,2,3,4)[STA].DPC[LIM][ASG]C[ASG].H
5 76101,9 [LIM]-(1,2,3,4)[ASG].DPC[LIM][ASG]C[ASG].H
6 34205,6 [STA]-(1,4)DPC[LIM][ASG]C[ASG].H
7 30325,1 [LIM]-(1,2,3,4)[STA][FY]D.C[LIM][ASG]C..H
8 29276 [LIM]-(1,2,3,4)[ASG][FY]D.C[LIM][ASG]C..H
9 20527,3 [ASG]-(1,4)DPC[LIM][ASG]C[ASG].H
10 17503,4 [LIM]-(1,2,3,4)[ASG]..PC[LIM][ASG]C[ASG].H
FIGURE 7.2: The functionally relevant motifs are shown in bold for Nickel-
dependent hydrogenases (id PS00508). Here 22 sequences of about 23,000
bases were analyzed.
1 24,3356 TTTGCTCA
2 16,1829 AAAAATGT
3 16,1829 AACTTAAA
4 16,1829 AAATCATG
5 16,0438 TTTGCTC
6 11,9715 ATAAAAA
7 11,9715 AAAAATG
8 11,9715 ACTTAAA
7.8 Exercises
Exercise 69 (Cells) Consider the discovery method discussed in Section 7.2.
Let d be the density parameter and the alphabet is Σ.
1. Construct an example s to show that it is possible that a cell p occurs
k times in s but a maximal pattern p′ where p is a substring of p′ may
occur more than k times in s.
2. If B is the collection of all cells in the input at the initialization phase,
show that
|B| ≤ (2 + d) |Σ|2 .
3. Prove that if the cells are processed in the order of saturation, the max-
imal motifs are emitted before their nonmaximal versions.
Hint: 1. Let d = 3, then for each pair, say, A, C ∈ Σ, the possible d + 2 cells
are:
A C, A . C, A . . C, A . . . C, A − C.
204 Pattern Discovery in Bioinformatics: Theory & Algorithms
1 8469,49 G.CAAAA.CCGC.GGCGG.A.T
2 1056,48 A.CGC.GCTT.G.AC.G.AA
3 528,79 GG.A.TC.T.T.G.TA.T.GC
4 527,143 TT.GA.ATG.TTT.T.TC
5 263,566 GT.CG.T.AT.G.ATA.G
6 263,293 TT.TC.T.C.CC.AAAA
7 263,293 GAT.ATA.AA.A.AG.A
8 263,293 CA.A.TA.TCA.TT.CT
9 263,293 T.TA.G.T.TTT.CTTC
10 263,022 T.ATA.T.TATTAT.A
11 131,499 ATA.A.AA.AG.A.AA
12 131,499 T.TTT.CTT.T.CC.A
13 131,364 G.TGT.AT.AT.TAA
14 131,229 C.T.AATAA.AAAT
15 131,229 TAT.G.TAATC.CT
3. Let density constraint d = 2. The extensible pattern p and its two occur-
rences. Cell, C T, occurs only once in the input s.
s=AGAGCT
p=AG−CT
AG AGCT
AG AG CT
Hint: 1. If in the processing of a cell, all its locations have been used, can it
be removed from the B set?
Exercise 71 (Generalized suffix tree) How is the ‘suffix’ tree shown be-
low related to the algorithm discussed in Section 7.2. Note that the labels in
the edges of the tree use the ‘.’ and ‘-’ symbols. The density parameter d = 2
206 Pattern Discovery in Bioinformatics: Theory & Algorithms
x .c
bydaxd$ .y .d ..d −d
a
{1} {5} {1,5} {1,5,9}
5 {9}
d$
x .c cdabydaxd$
b .y .d ..d −d
1 9
(1) (2)
a a
x .c x .c
bydaxd$ .y .d ..da-d -d bydaxd$ .y .d ..da-d -d
(3) (4)
Algorithms & Pattern Statistics 207
|Lp1 | = |Lp2 |.
For the problems below, see Section 5.2.4 for a definition of random
strings.
Exercise 74 Let
p = G A A T T C.
1. What is the odds of seeing p at least 10 times in a 600 base long random
DNA fragment?
2. What is the expected copy number of p in a random DNA fragment
of length n (i.e., how many times do you expect to see p in the DNA
fragment)?
208 Pattern Discovery in Bioinformatics: Theory & Algorithms
z = {A, C, G, T}
in s. Let
prA = prC = prG = prT = 0.25.
1. What is the probability of seeing a pattern p of length l?
2. How do the odds change if p must occur at least k > 1 times?
3. What can you conclude if n ≫ l?
What are the answers to the three questions under the assumption:
{A, C, G, T}
(note each character occurs with equal probability), what is the longest stretch
of A’s you expect to see?
G A A T T C,
what is the average length of a fragment? Assume that each base occurs at a
position with equal probability and independently.
Algorithms & Pattern Statistics 209
1. two A’s,
2. at least one C,
1
.
1024
What is the average distance between two mutation (or restriction sites)?
What is the standard deviation?
Ω = {1, 2, 3, . . .}
pr(k) = (1 − p)k−1 p.
X ∼ Exponential(k; p)
Show that
1
E[X] = ,
p
1−p
V [X] = .
p2
210 Pattern Discovery in Bioinformatics: Theory & Algorithms
R(q) = {q1 , q2 , . . . , ql },
where l > 1. Since at any location in the input one of R(q) occurs, the
probability of occurrences of q is the sum of the probability of occurrence of
the rigid motif
qj ∈ R(q), 1 ≤ j ≤ l.
Thus the probability of occurrence of q, prq , is given as
X
prq = prqj .
qj ∈R(q)
R(q) = {A C, A . C, A . . C, A . . . C, A . . . . C}.
Then
It follows
Exercise 82 Let
Σ = {A, C, F, G, L, V}
and the probability of occurrence of σ ∈ Σ is prσ with
X
prσ = 1.
σ∈Σ
Let
q1 = A C . . L
Algorithms & Pattern Statistics 211
q1 ≺ q2 and
L q1 = L q2 .
q1 ≺ q2 .
1. Let
s1 = A C G T A C G T C G T G T.
Enumerate the maximal solid patterns in s1 for each of the maximal
definitions.
2. Let
s2 = A C G T A C G T.
Enumerate the maximal solid patterns in s2 for each of the maximal
definitions.
3. Then is it true that the z-scores satisfy z(q1 ) ≥ z(q2 )?
4. Compare the two definitions of maximality.
212 Pattern Discovery in Bioinformatics: Theory & Algorithms
Hint: Definition 1:
Definition 2:
213
214 Pattern Discovery in Bioinformatics: Theory & Algorithms
rat brain.1
GKK.DD
1 See the cited paper for any further details on this example.
Motif Learning 215
The emphasis is here on ‘always’. Clearly, the fourth element violated the
‘always’ criterion giving a ‘not always’ condition, hence delegated down to a
dont care. 2 So, can we find a middle ground between ‘always’ and dont care?
A probabilistic model of a motif associates a real number between zero and
one (probability) with each element (residue) may occur in a sequence. This
is defined formally below.
Consider an alphabet of size L = |Σ| as follows.
Σ = {σ1 , σ2 , . . . , σL }.
2 In fact this discontinuity gets in the way of elegant formalization under some combinatorial
models.
216 Pattern Discovery in Bioinformatics: Theory & Algorithms
F : R → R,
where R is the set of all motif profiles ρ (of the same length l). In other
words, for a given collection of input sequences,
log(likelihood)
is such a measure of the profile. The higher the value, usually the more
(statistically) significant the motif profile.
The positions in the input data is divided into
Motif Learning 217
σr ∈ Σ
in all nonmotif positions. Thus column 0 of the matrix describes the ‘back-
ground’.
For ease of exposition, the row r in the matrix ρ will be replaced by the
character it represents, i.e., σr . Also, we will switch between notation
ρij
and
ρ[i, j],
depending on the context, for ease of understanding. Note that we use column
0 in the ρ matrix to denote the nonmotif or background probabilities of each
character in the alphabet. Then, given the input and the matrix ρ for rows
1 ≤ r ≤ |Σ|
and columns
0 ≤ c ≤ l,
the log of the likelihood is given as
Yt ni
Y
F1 = log ρ[sij , Cij ]
i=1 j=1
t
X Xni
= log(ρ[sij , Cij ]) .
i=1 j=1
218 Pattern Discovery in Bioinformatics: Theory & Algorithms
Let f be a |Σ| × (l + 1) matrix and each entry fσc denotes the number of
positions (given by i and j) in the input with annotation c, i.e.,
Cij = c,
Then
l
XX
F1 = fσc log(ρσc ).
σ∈Σ c=0
Yet another effective measure is by taking a ratio with the background prob-
abilities as follows:
l
XX ρσc
F2 = fσc log .
ρσ0
σ∈Σ c=1
Alignment
s1 = AACCT A
s2 = AT GT AGG
s3 = A T A C T A
consensus ACG
Note that this is a very small example and statistical methods work well
for larger data sets. However we use this to explain the formula. Let the
probability matrix ρ be given as follows. Then using the alignment and matrix
ρ, the frequency matrix f can be constructed as follows.
0.4 0.9 0.1 0.09 5 3 0 0 A
0.01 0.04 0.8 0.1 0 0 2 1 C
0.4
ρ= 0.04 0.09 0.8
, 2
f = 0 0 1 G.
0.19 0.02 0.01 0.01 3 0 1 1 T
0 1 2 3 0 1 2 3 ←c
Motif Learning 219
Method 1: Method 2:
1. Initialize ρ0 1. Initialize Z0
2. Repeat 2. Repeat
(a) Re-estimate Z from ρ (a) Re-estimate ρ from Z
(b) Re-estimate ρ from Z (b) Re-estimate Z from ρ
3. Until change in ρ is small 3. Until change in Z is small
FIGURE 8.2: Given the input sequences and the motif length, two possible
learning methods to learn a motif profile.
2. estimate ρ, given z.
s1 = A G G C T T A G C T G.
However this motif profile will never change over the iterations. The proof is
straightforward and we leave that as an exercise for the reader (Exercise 87).
Thus, in practice, no entry of the ρ0 matrix is set to 0.0. However, since
one entry in the column is biased towards one character, the following is a
good initial estimate for the example above.
0.15 0.15 0.15 A
0.55 0.15 0.15 C
ρ0 =
.
0.15 0.15 0.15 G
0.15 0.55 0.55 T
Motif Learning 223
0 ≤ zij ≤ 1,
where
si(j+c−1) = σc .
In other words the sequence si has the character σc at position j + c − 1. This
is a straightforward interpretation of the motif profile. For example, let the
motif profile where l = 5 be given as
0.6 0.3 0.05 0.7 0.1 A
0.2 0.1 0.05 0.1 0.6 C
ρ= .
0.1 0.5 0.1 0.1 0.1 G
0.1 0.1 0.8 0.1 0.2 T
224 Pattern Discovery in Bioinformatics: Theory & Algorithms
s1 = A G G C T T A G C T G.
Since this is a probability, we normalize this by summing over all the values
in sequence si . Thus, for each i,
′
zij
zij = Pni −l+1 . (8.3)
′
zij
j=1
1 ≤ i ≤ t and
1 ≤ j ≤ (ni − l + 1), for each i.
Also, since zij are probabilities, under the assumption that a motif occurs
exactly once per sequence,3 we assume the following for each i.
niX
−l+1
zij = 1. (8.4)
j=1
Further, let ρ(q) denote the estimate of ρ and z (q) denote the estimate of z
after q iterations. Given ρ(q) , the probability of sequence si , given the start
position of the motif (profile), is
Yl
P si |Xij = 1, ρ(q) = ρ(q)
xc c ,
c=1
where
si(j+c−1) = xc .
Note that using our notation,
zij = P si |Xij = 1, ρ(q) .
where P 0 (Xij = 1) is the prior probability that the motif starts at position j
in sequence si . Since no information is available about the occurrence of the
motif, P 0 is assumed to be uniform. Thus, for each 1 ≤ i ≤ t,
1
P 0 (Xij = 1) = , for 1 ≤ j ≤ ni − l + 1.
ni − l + 1
Then the denominator in Equation (8.5) simplifies to
niX
−l+1
P si |Xic = 1, ρ(q) P 0 (Xic = 1)
c=1
niX
−l+1
1
= P si |Xic = 1, ρ(q)
c=1
ni − l + 1
niX
−l+1
= P si |Xic = 1, ρ(q) .
c=1
Notice that Equation (8.6) has the same form as Equation (8.3) of Sec-
tion 8.6.2.
Thus we have shown that Method 1 can be viewed as an expectation max-
imization strategy.
There has been a flurry of activity around this problem [EP02, KP02b].
For instance, Improbizer [AGK+ 04] also uses expectation maximization to
determine weight matrixes of DNA motifs that occur improbably often in the
input data.
Motif Learning 227
s[i, j].
for each c.
228 Pattern Discovery in Bioinformatics: Theory & Algorithms
K ′ 6= t.
Recall that the task is to discover or recover motif profiles, for some fixed
motif length l given t sequences. what qualifies as a motif profile?
for each j.
Next, a single character must dominate significantly in a position, say
j, to specify a ‘solid’ character in the motif. One way of defining this is
as follows: Given some fixed
0<δ<1
Then the motif at position j takes the value ρi′ j . If this does not hold,
then that position is defined to be a dont-care.
230 Pattern Discovery in Bioinformatics: Theory & Algorithms
AGT AC
8.9 Exercises
Exercise 85 (Statistical measures) What is the relationship between mea-
sure F2 and the information content I, for an input set of sequences and a
motif profile ρ of length l, shown below:
l
XX ρσc
F2 = fσc log ,
ρσ0
σ∈Σ c=1
l
XX ρσc
I= ρσc log .
fσ
σ∈Σ c=1
Exercise 86 (Estimating ρ) See Section 8.5 for definition of the terms used
here.
Given z, consider the following scheme for estimating ρ using sequence si .
For each 1 ≤ c ≤ l, define
X
ρσc = zi(j+c−1) .
si [j+c−1]=σ
s1 = A C G A A C G G A A
z1j = 0.05 0.2 0.1 0.1 0.05 0.2 0.05 0.1 0.1 0.05
ρrc = 0.0.
Then argue that at all subsequent iterations in the algorithm (both Meth-
ods 1 and 2) ρrc is likely to remain 0.0.
ρrc = 1.0.
Then argue that at all subsequent iterations in the algorithm (both Meth-
ods 1 and 2) ρrc is likely to remain 1.0.
In other words the motif is likely to have σr at position c in all iterations.
C T T.
232 Pattern Discovery in Bioinformatics: Theory & Algorithms
Hint: 1. See the update procedures. 2. Note that all the other entries in
that column must be zero. 3. From 1 & 2.
Exercise 89 (Σ size) Discuss the effect of alphabet size |Σ| on the learning
algorithm.
Hint: The number of nucleic acids is 4 and the number of amino acids is 20.
So can a system that discovers transcription factors in DNA sequences also
discover protein domain motifs? Why? What parameters need to change?
What about a binary sequence?
s2 = A T G T A G G
s3 = A T A C T A
This gives two possible alignments:
Alignment 1 Alignment 2
s1 AACCT A s1 AACCT A
s2 AT GT AGG s2 AT GT AGG
s3 AT AC T A s3 AT AC T A
motif? ACG motif? ACG
If s3 also had 2 occurrences, how many alignments could there be? In the
worst case, how many alignments are possible? How is the multiplicity of this
kind incorporated in the probability computations?
Motif Learning 233
Exercise 91 (Generalizations)
1. Discuss how Expectation Maximization (Method 1) presented in this
chapter can be generalized to handle multiple occurrences in a single
sequence of the input.
2. Discuss how the Gibbs Sampling approach (Method 2) can be generalized
to handle multiple occurrences in a single sequence of the input.
3. Can the methods be extended to incorporate unsupervised learning? Why?
Hint: 1. How should z be updated? 2. What should Z0 be ? How is
Z0 updated? 3. One of the major difficulties is in guessing in which of the
sequences the motif is absent while estimating ρ or Z.
1 See the cited paper for any further details on this example.
235
236 Pattern Discovery in Bioinformatics: Theory & Algorithms
No pos Predictions M I
0 −101 T G A C G T C A 1
1 −299 T G C − G T C A 1
2 −71 T G A C A T C A 1 1
3 −69 A T G A − G T C A G 2
4 −527 T G C G A T G A 2 1
6 −173 T G A − C T A A 2
7 −1595 T G A − A T G A 2
8 −221 T G G − G T C T 2
9 −69 T G A − C T G C 3
10 −105 T G A − A T C A 1
12 −780 T G C − G T C A 1
14 −1654 A T G A − A T C A 1 1
15 −69 A T G A − G T C A A 2
16 −97 T G A − G T A A 1
17 −1936 A T G A − A T C A 1 1
signal TGA GTCA
FIGURE 9.1: An example of a subtle motif (signal) as a transcription
binding factor in human DNA. Notice that at each occurrence the motif is
some ‘edit distance’ away from the consensus signal. The edit operations are
mutation (M) and insertion (I): see Section 9.3 for details on edit distance.
{A, C, G, T }
and the problem is made difficult by the fact that each occurrence of the
pattern p may differ in some d positions and the occurrence of the consensus
pattern p may not have
d=0
in any of the sequences.
The Subtle Motif 237
This is a rather difficult criterion to meet since the learning algorithms use
some form of local search based on Gibbs sampling or expectation maximiza-
tion. Hence it is not surprising that these methods may miss p.
However, a question of this form is a biological reality. Consider the fol-
lowing, somewhat contrived, variation of Problem 9 which is an attempt at
simplifying the computational problem.
Pevzner and Sze [PS00] made the question more precise and provided a bench-
mark for the methods, by fixing the following parameters:
A solution to this apparently simplified problem was so difficult, that this was
dubbed the challenge problem.
This chapter discusses methods that solves problems of this flavor. This
formalization, in a sense, is the combinatorial version of the problem discussed
in Chapter 8. A further generalization, along with a method to tackle it, is
presented in the concluding section of the chapter.
We first clarify the different ‘motifs’ used in this chapter. The central goal
is to detect the consensus or the embedded or the planted motif in the given
data sets which is also sometimes referred to as the signal in the data or the
subtle signal. When a motif is not qualified with these terms, it refers to a
substring that appears in multiple sequences, with possible wild cards.
Hamming(p1 , p2 ) = 2,
since
p1 [1] = p2 [1], p1 [3] = p2 [3], and p1 [4] 6= p2 [4]
but
p1 [2] 6= p2 [2] and p1 [5] 6= p2 [5]
marked as ‘X’ in the alignment.
Edit distance. Given a pattern (or string) p1 various edit operations can
be performed on p1 to produce p2 . Here we describe three edit operations
The Subtle Motif 239
σ(6= p1 [j]) ∈ Σ.
p2 = A G C or A T C or A A C.
p2 = A A C C or A C C C or A G C C or A T C C.
2 Itis possible to have edit operations defined on a segment of the string instead of a single
location j. For example inversion on position 2-4 can transform
p1 = G A C T C to p2 = G T C A C.
240 Pattern Discovery in Bioinformatics: Theory & Algorithms
Let the occurrence of a motif in the input be o. For a given length l and d
(< l), a motif p is a subtle motif if at each occurrence, o, in the input
dis(p, o) ≤ d.
For example let l = 5 and d = 2 and the three occurrences in the sequences
are shown below.
s1 = T A T C C T
s2 = ACTCA C
s3 = C T C C A A
dis(p, 01 ) = 2 ≤ d,
dis(p, 02 ) = 1 ≤ d,
dis(p, 03 ) = 2 ≤ d.
Thus one must look at all the occurrences to infer what the motif must be.
K′ ≥ K
For simplicity, the sequences are the same length l and all the t sequences
are aligned and we will further assume that a pattern occurs at most once in
each sequence.
Given a motif, let the embedded signal in each sequence be constructed
with some d edit operations. Given one of these edit operations, we assume
qM + qX + qI = 1.
The model. We consider the following simplified model. Given a fixed pattern
(or signal),
psignal ,
Assume that d out of the l positions are picked at random on the embedded
motif for exactly one of the edit operations, insertion, deletion or mutation.
l can be viewed as the size of the motif. Recall that we assume that the
sequences are correctly aligned. Then if a position in the aligned sequence is
a mismatch, then either it is due to
d
(qM + qI ) .
l
Next, the probability q of a position to be a solid character in a motif is:
d
q =1− (qM + qI ) . (9.1)
l
q for three scenarios is shown below.
qM qX qI q
1) Exactly d mutations 1 0 0 1 − d/l
2) Exactly d edits 1/3 1/3 1/3 1 − 2d/3l
Exactly d edits with,
3) 1/2 1/4 1/4 1 − 3d/4l
equiprobable indel and mutation
When no more than d′ edit operations are carried out on the embedded
motif, it is usually interpreted as each collection of
0, 1, 2, . . . , d′
d′
d=
2
for Equation (9.1).
Alignment Pattern
ACG−T c C A CG−T c C←
A−G−TAC A −G−TAC←
ACG a TAC A CG a TAC←
AC c −TAC A C c −TAC←
g CG−TAC g CG−TAC
A T C k
For a pattern p with some H solid characters, let p occur in some k sequences
(and not in the remaining (t − k) sequences). Then
The Subtle Motif 243
(a)
(b)
FIGURE 9.2: For t = 20, l = 20, the expected number of maximal motifs
E[ZK,q ], is plotted against (a) quorum K shown along the X-axis (for different
values of q which are close to 1.0), and, (b) against q shown along the X-axis
(for different values of quorum K).
FIGURE 9.3: For t = 20, l = 20, the expected number of maximal motifs
E[ZK,q ], is plotted against quorum K shown along the X-axis, for different
values of q, in a logarithmic scale. Unlike the plot in Figure 9.2, the value of q
here varies from 0.25 to 1.0. Notice that when q = 1, the curve is a horizontal
line at y = 1. Note that for DNA sequences, q = 0.25 corresponds to the
random input case.
If Ep denotes the event that p occurs in some fixed k sequences then for any
two distinct events, i.e.,
p1 6= p2 ,
Ep1 and Ep2 are not necessarily mutually exclusive. However, if the pattern
is maximal, i.e., H is the maximum number of solid characters seen in the
k sequences, then for a fixed set of k sequences, there is at most one maxi-
mal pattern that occurs in these k sequences and not in the remaining t − k
sequences. Further, when the pattern is maximal there is a guarantee of mis-
match in the remaining (l − H) positions in all the k rows and the probability
of this mismatch is given as
(1 − q k )l−H . (9.3)
Thus if
Pmaximal (K, H, q)
The Subtle Motif 245
is the probability that some maximal pattern with H solid characters and
quorum K occurs in the input data, then using equation (9.4),
t
X t t−k H k
Pmaximal (K, H, q) = 1 − qH q (1 − q k )l−H . (9.5)
k
k=K
Let
ZK,q
be a random variable denoting the number of maximal motifs with quorum
K and q as defined above, and,
E[ZK,q ]
denotes the expectation of ZK,q . Using linearity of expectations (for a fixed t
and l),
l
X l
E[ZK,q ] = Pmaximal (K, h, q)
h
h=1
l t
!
X l X t h t−k h k
k l−h
= 1−q q (1 − q ) .
h k
h=1 k=K
|Csignal | = |Σ|l .
Ci = {p | p is a substring of length l in si }.
This can be obtained by a single scan of si from left to right, and at each
location j, extracting a pattern p as
p = s[j . . . (j + l − 1)].
Step 2 (Computing C′1 , C′2 , . . . , C′t ). For each p ∈ Ci construct the ‘neigh-
borhood’ patterns as follows:
{p1 , p2 , . . . , pr } ⊂ Ci ,
which is the set of r patterns such that p′ is at distance d from each of these
patterns.
What is the size of each Ci′ ? The number of positions that are mutated in
the pattern p is d, thus the number of distinct patterns with some mutations
is these positions is no more than
l
.
d
Further, if the original value at one of the positions is σ, then it can take any
value from the set
Σ \ {σ}.
Thus the total number of distinct patterns at a distance d from a pattern is
no more than
l
(|Σ| − 1)d . (9.8)
d
Using Equations (9.6) and (9.8),
l d
|Ci′ | ≤ |Ci | (|Σ| − 1)
d
l d
≤ (|si | − l + 1) (|Σ| − 1)
d
l d
= O |si | |Σ| .
d
For each
p′′ ∈ Csignal ,
248 Pattern Discovery in Bioinformatics: Theory & Algorithms
j1 , j2 , . . . , jt .
s1 = T A T C C, Σ = {A, T, C},
s2 = A C T C A, t = 3,
s3 = C T T T C, l = 4 and d = 1.
C1 = {T AT C, AT CC},
C2 = {ACT C, CT CA},
C3 = {CT T T, T T T C}.
Since
N ebor(T AT C, 1), N ebor(AT CC, 1),
T CT C ∈ N ebor(ACT C, 1), and AT T C ∈ N ebor(ACT C, 1),
N ebor(T T T C, 1), N ebor(T T T C, 1),
Using T CT C : Using AT T C :
s1 = T AT CC s1 = T A T C C
s2 = ACT C A s2 = AC T CA
s3 = C T T T C s3 = C T T T C
consensus T C T C consensus A T T C
T C T C and A T T C,
mask1 =1 1 . .
mask2 =1 . 1 .
mask3 =1 . . 1
mask4 = . 11 .
mask5 = . 1 . 1
mask6 = . . 1 1
n − (l + 1)
times. For example, the 2-mers picked for mask1 and mask2 are shown below.
s1 T A T C C T A T C C s1 T A T C C T A T C C
mask1 1 1 . . 1 1 . . mask2 1 . 1 . 1 . 1 .
2-mers T A AT 2-mers T T A C
The complete list of 2-mers picked by the masks are listed below.
s1 s2 s3
T AT C C AC T C A C T T T C
mask1 T A..; AT.. AC..; CT.. CT..; T T..
mask2 T.T.; A.C. A.T.; C.C. C.T.. T.T.
mask3 T..C; A..C A..C; C..A C..T ; T..C
mask4 .AT.; .T C. .CT.; .T C. .T T.; .T T.
mask5 .A.C; .T.C .C.C; .T.A .T.T ; .T.C
mask6 ..T C; ..CC ..T C; ..CA ..T T ; ..T C
The Subtle Motif 251
Step 3. The local alignments suggested by some of the masks are shown below.
s1 = T A T C C
s1 s2 s3
s2 = A C T C A (IV)
..T C 1 +1 +1
s3 = C T T T C
s1 s2 s3
− (V)
C.C. 0 +2 +0
Consensus alignment of the three sequences give the signal as shown below.
Step 2.
s1 s2 s3
T AT C C AC T C A CTTTC
mask1 T AT.; AT C. ACT.; CT C. CT T.; T T T.
mask2 T A.C; AT.C AC.C; CT.A CT.T ; T T.C
mask3 T.T C; A.CC A.T C; C.CA C.T T ; T.T C
mask4 .AT C; .T CC .CT C; .T CA .T T T ; .T T C
252 Pattern Discovery in Bioinformatics: Theory & Algorithms
Step 3.
l-mers & alignment
no. of support
s1 s2 s3 s1 = T A T C C
T.T C 1 +0 +1 s3 = C T T T C
Notice that this does not extract the l length signal. It just extracts the
following
T.TC
This example illustrates the fact that this enumeration is inexact, since one
of the solutions is missed in Case 2. It is also quite possible that a solution is
missed in all possible values of k.
V = V1 ∪ V2 ∪ . . . ∪ Vt ,
where for i 6= j, 1 ≤ i, j ≤ t,
Vi ∩ Vj = ∅,
and for
each pair vi1 , vi2 ∈ Vi , vi1 vi2 6∈ E holds.
In other words, the vertex set can be partitioned into t (nonintersecting) sets
such that the edges go across the sets but not within each set.
Given a graph G(V, E), a subgraph
G(V ′ ⊂ V, E ′ ⊂ E),
is a clique if for
another difficult problem? The clique finding problem is well studied and
various heuristics have been designed to effectively solve the problem and we
wish to exploit these insights to solve our problem at hand. However, to avoid
digression we do not discuss the details of solving the clique problem here.
2 2
TATC ACTC TTTC
2 2
ATCC CTCA CTTT
2
FIGURE 9.4: The tripartite graph with each partition (of vertices) ar-
ranged along a column. The two cliques are the top row and bottom row
respectively of vertices.
It is easy to see that G(V, E) is a t-partite graph. Next the task is to find all
cliques of size t in the graph.
Each such clique gives an alignment of the input sequences and a consensus
motif p. This p is checked to see if the problem constraints are satisfied.
Example (2). The 3-partite graph is constructed as follows (see Figure 9.4).
The vertex set is
V = V1 ∪ V2 ∪ V3 ,
where each Vi is defined as follows.
1. V1 = {v11 , v12 },
where T AT C is mapped to v11 and AT CC is mapped to v12 .
2. V2 = {v21 , v22 },
where ACT C is mapped to v21 and CT CA is mapped to v22 .
3. V3 = {v31 , v32 },
where CT T T is mapped to v31 and T T T C is mapped to v32 .
The following upper diagonal matrix shows the Hamming distance between
two l-mers mapped to the two vertices. Each nonzero distance gives an edge
in the graph.
T C T C and A T T C,
Note that distinct condensed submotifs could give rise to the same k-mers.
For example,
O(1)
time to access the entry in the table. Thus the tables shown in Step 3 of the
enumeration scheme of Section 9.6.2 can be efficiently constructed, or filled
in. Note that in this case a condensed submotif is a key to the hash table,
which results in considerable reduction in the size of the hash table. Also, at
each iterant a distinct value of k is used and the same hash table is fortified
with more entries. Thus each iteration strengthens the table.
At the end of this process, each entry in the table that shows significant
support is picked up for further scrutiny and the hidden signal is extracted
from the local alignment suggested by the support. In fact this step uses
the learning algorithms discussed in Chapter 8. The reader is directed to the
paper by Buhler and Tompa [BT02] for further details of this algorithm.
The Subtle Motif 257
2. sequence alignment.
By delegating the task to these subproblems, the method can also handle
deletions and insertions (called indel) in the embedded signal.
This delineation into two steps also helps address the more realistic version
of the problem that includes insertion and deletion in the consensus motif
(Problem 11). The main focus of this method is in obtaining good quality PS
segments and restricting the number of such segments to keep the problem
tractable.
E[ZK,q ]
The Subtle Motif 259
with
h i
E ZK, 41 ,
the expectation for the random case. See Figure 9.3 for the plots of
log(E[ZK,q ])
q = 0.75,
this is the approximate value of q for the challenge problem of Section 9.1. In
Figure 9.3, this is shown by the red curve and for large K, say
K ≥ 16,
9.10 Conclusion
We have discussed several strategies to tackling the problem of finding sub-
tle signals across sequences. This continues to be an active area of research
with very close interaction between biologists, computer scientists and math-
ematicians.
4 Varun
[ACP05] is available at:
www.research.ibm.com/computationalgenomics.
260 Pattern Discovery in Bioinformatics: Theory & Algorithms
9.11 Exercises
Exercise 93 (Distance) Let Σ = {0, 1}. A pattern p of size l is defined on
Σ.
where the edit operations allowed are (a) mutation, (b) insertion and (c)
deletion.
Hint: Design an ‘enumeration tree’ particularly for 1(ii) and 2(ii) to avoid
multiple enumerations of the same patterns.
d = Hamming(p, p1 )
= Hamming(p, p2 ),
then
2d ≥ Hamming(p1 , p2 ).
Comments
The topic of this chapter exemplifies the difficulties with biological reality.
Elegant combinatorics and practical statistical principles, along with biologi-
cal wisdom may be sometimes required to answer innocent-looking questions.
Part III
Patterns on Meta-Data
Chapter 10
Permutation Patterns
10.1 Introduction
In this chapter we deal with a different kind of motif or pattern: one that
is defined by merely its composition and not the order in which they appear
in the data. For example, consider two chromosomes in different organisms.
We study gene orders in a section of the chromosomes of two organisms as
shown below:
s1 = . . . g1 g2 g3 g4 g5 g6 g7 . . .
Genes gi (in s1 ) and gi′ (in s2 ) are assumed to be orthologous genes. Clearly,
it is of interest to note that the block of genes g2 , g3 , g4 , g5 appear together,
albeit in a different order in each of the chromosomes. This collection of genes
is often called a gene cluster. The size of the cluster is the number of elements
in it and in this example the size is 4. Such clusters or sets of objects are
termed permutation patterns.1 They are called so because any one of the
patterns can be numbered 1 to L where L is the size of the pattern and every
other occurrence is a permutation of the L integers. For example in s1 , the
pattern can be numbered as
1 2 3 4,
4 1 3 2.
1 This cluster or set is also called a Parikh vector (Section 10.4) or a compomer (Exercise 118)
265
266 Pattern Discovery in Bioinformatics: Theory & Algorithms
10.1.1 Notation
Recall from Chapter 2 (Section 2.8.2) that Π(s) denotes the set of all char-
acters occurring in a sequence s. For example, if
s = a b c d a,
then
Π(s) = {a, b, c, d}.
However s may have characters that appear multiple times (also referred to as
the copy number). Then we use a new notation Π′ (s). In this notation, each
character is annotated with the number of times it appears. For example,
s = a b b c c b d a c b,
Π(s) = {a, b, c, d},
Π′ (s) = {a(2), b(4), c(3), d}.
Thus element a has copy number 2, b has copy number 4 and so on. Note that
d appears only once and the copy number annotation is omitted altogether.
Given an input string s on a finite alphabet Σ, a permutation pattern (or
πpattern) is a set p ⊆ Σ. p occurs at location i on s if
p = Π (s [i, i + 1, . . . , i + L-1]) ,
s = aacbbb xx abcbab .
o1 = a a c b b b,
o2 = a b c b a b.
Otherwise,
p = {a, b, c}.
The size of the pattern p is written as |p|. In the first case |p| = 6 and in the
second case |p| = 3. Note that in both cases, the length at each occurrence of
the pattern p must be the same.
Permutation Patterns 267
s = a b c d e a b c d e a b c d e,
10.3 Maximality
In an attempt to reduce the number of permutation patterns in an input
string s, without any loss of information, we use the following definition of a
maximal pattern [LPW05].
Let P be the set of all permutation patterns on a given input string s.
(p1 ∈ P ) is nonmaximal with respect to (p2 ∈ P ) if both of the following hold.
(1) Each occurrence of p1 on s is covered by an occurrence of p2 on s. In
other words, each occurrence of p1 is a substring in an occurrence of p2 .
(2) Each occurrence of p2 on s covers l ≥ 1, occurrence(s) of p1 on s.
A pattern (p2 ∈ P ) is maximal, if there exists no (p1 ∈ P ) such that p2 is
nonmaximal w.r.t. p1 .
It is straightforward to verify the following and we leave the proof as an
exercise for the reader (Exercise 99). Note that this directly follows from the
framework presented in Chapter 4.
LEMMA 10.1
(Maximal lemma) If p2 is nonmaximal with respect to p1 , then p1 ⊂ p2 .
|Lp1 | = |Lp2 | ?
The sizes of the location lists must be the same when each element of the
pattern p1 and p2 has a copy number 1. See Exercise 100 for the possible
Permutation Patterns 269
relationship between |Lp1 | and |Lp2 | when copy number of some elements
> 1.
However, to show that maximality as defined here is valid, it is important
to show the uniqueness of the set of maximal permutation patterns. Again,
this also follows from the framework presented in Chapter 4.
THEOREM 10.1
(Unique maximal theorem) Let M be the set of all maximal permutation
patterns, i.e.,
M = {p ∈ P | there is no (p′ ∈ P ) maximal w.r.t p}
Then M is unique.
o1 , o2 , . . . , oK
Thus the data structure (called a PQ Tree) used in the solution to the GCA
problem can be used as the representation to capture M (p). See Chapter 13
for an exposition on this.
Consider a pattern p and its collection of nonmaximal patterns M (p) given
in Figure 10.2. The PQ tree representation of the maximal pattern in shown
in Figure 10.2. The root node represents the maximal permutation pattern
given as set (10.1), the Q node represents the nonmaximal patterns given as
set (10.2) and the internal P node represents the nonmaximal pattern given
as set (10.3).
Using the symbol ‘-’ to denote immediate neighbors, since the PQ tree is a
hierarchy, it can also be written linearly as:
a b c d e f g
FIGURE 10.2: The PQ tree notation of the maximal pattern p.
o1 = d e a b c x c,
o2 = c d e a b x c,
o3 = c x c b a e d.
Assume that none of the elements of p appear elsewhere in the input. What
are the nonmaximal patterns?
Recall that the leaves of a PQ tree are labeled bijectively by the elements
of p. Since p has at least one element σ with copy number c > 1, then the
tree must have c leaves labeled by σ. Assuming we can abuse a PQ structure
thus, can a PQ tree represent all the nonmaximal patterns?
Can we simply rename the two c’s as c1 and c2 ? We can fix this in o1 , but
which c is c1 and which one is c2 in o2 and in o3 ? We must take all possible
renaming into account.
272 Pattern Discovery in Bioinformatics: Theory & Algorithms
The nonmaximal patterns are shown as nested boxes. The following trees
capture the nonmaximal patterns: T1,3 represents the first and third cases,
T2 and T4 represent the second and fourth cases respectively.
c
x c
d e a b
c c
d e a b x c d e a b
x c
T1,3 T2 T4
|Σ| = O(1),
then in O(1) time each new or old pattern can be accounted for (using an
appropriate hash function), giving an overall O(n) time algorithm, for a fixed
L. However, this assumption may not be realistic and in general
|Σ| = O(n).
Then the approach needs some more care for efficiency and the discussion
here using Parikh Mapping is adapted from an algorithm given by Amir et
al [AALS03]. The reader is directed to [Did03, SS04, ELP03] for a discussion
on other approaches to this problem.
However, the discovered patterns in the algorithm are not in the maximal
notation. We postpone this discussion to Section 10.5 where the Intervals
Problem is presented. We take all the occurrences,
o1 , o2 , . . . ok
274 Pattern Discovery in Bioinformatics: Theory & Algorithms
s1 , s2 , . . . sk ,
and the output is the maximal notation of p in terms of a PQ tree. Note that
if p has multiplicities, then the Intervals Problem is invoked multiple times:
see Section 10.3.2 for a detailed discussion on this.
Ψ[1 . . . |Σ|],
where Ψ[q] keeps count of the number of appearances of letter q in the current
window. Hence, the sum of the values of the elements of Ψ is L. In each
iteration the window shifts one letter to the right, and at most 2 variables of
Ψ are changed:
(1) one variable is increased by one (adding the rightmost letter) and
(2) one variable is decreased by one (deleting the leftmost letter of the
previous window).
2 More than 40 years after the appearance of this seminal paper, during an informal con-
versation, Rohit Parikh, a logician at heart, told me that he had done this work on formal
languages for money! He explained that as a graduate student he was compelled to take a
summer job that produced this work.
Permutation Patterns 275
Note that
|Σ| = 6
and the Parikh Mapping array Ψ is padded with the • character so as to make
it a power of 2. This complete example is described in Figure 10.3.
LEMMA 10.2
The maximum number of distinct tags generated by the algorithm’s tagging
scheme, using a window of size L on a text of length n is
O(|Σ|).
n − L + 1.
At each iteration, at most log |Σ| changes are made due to addition of a new
character to the right and at most log |Σ| changes are made due to the removal
of an old character to the left. Thus at each iteration j no more than
2 log |Σ|
new tags are generated in the binary tagging tree of Ψj . Thus the number of
distinct tags is
t = O(|Σ| + n log |Σ|).
To give a subarray at level > 1 a tag, we need only to know if the pair of
tags of the composing subarrays has appeared previously. If it did, then the
array gets the tag of this pair. Otherwise, it gets a new tag. Assume that
the first elements of the tag pairs is stored in a balanced tree T1 . Further
the pairs are gathered and yet another tree Tv2 is stored at each node v of T1
which is also balanced. Thus it takes
O((log t)2 )
time to access a tag pair where both T1 and Tv2 are binary searched and t is
the number of distinct tags.
To summarize, it takes
O(|Σ|)
time to initialize the binary tagging tree of Parikh Mapping array Ψ. The
number of iterations is
O(n)
and at each iteration
O(log |Σ|)
changes are made, each of which takes
O((log t)2 )
278 Pattern Discovery in Bioinformatics: Theory & Algorithms
time, for a fixed L. If L∗ the size of the largest pattern on s is not known,
then this algorithm is iterated O(n) times.
10.5 Intervals
The last section gives an algorithm to discover all permutation patterns in
a given string s. We now take a look at a relatively simple scenario: Given K
sequences where n characters appear exactly once in each sequence and each
sequence is of length n, the task is to discover common permutation patterns
that occur in all the sequences.
In other words, s1 can be viewed as the sequence of integers 1, 2, 3, . . . , n
and each of
s2 , s3 , . . . , sK
is a permutation of n integers. See Figure 10.4 for an illustrative example.
Why is this problem scenario any simpler? For input sequences s2 , s3 , . . . sK ,
the encoding to integers allows us to simply study the integers and deduce if
they are potential permutation patterns or not. For example a sequence of
the form
4 6,
can never contribute to a common permutation pattern of size 2 since, in s1
the two are not immediate neighbors. By the same argument, the subsequence
645
3 Heber and Stoye call the difference between the maximum and minimum values the interval
defect.
Permutation Patterns 279
ad cb e ⇒ 1 2 3 45
ce da b ⇒ 3 5 2 14
da be c ⇒ 2 1 4 53
bc ea d ⇒ 4 3 5 12
s = 1 2 3 4 . . . n.
Clearly, each of
O(n2 )
intervals.
LEMMA 10.3
Let s be a sequence of n integers where each number appears exactly once.
Then for all 1 ≤ i < j ≤ n, the following statements hold.
1. f (i, j) ≥ 0. In other words,it cannot take negative values.
2. If f (i, j) = 0, then [i, j] is an interval.
Permutation Patterns 281
Lines (2), (4), (5), (6), (7) take O(1) time each. Lines (4), (5), (6), (7) are
executed
n(n − 1)
1 + 2 + . . . + (n − 2) + (n − 1) =
2
times. Thus the entire algorithm takes
O(n2 )
time.
Notice that the number of intervals in s could be O(n2 ). Thus an algorithm
that outputs all the intervals must do at least O(n2 ) work. But what if s has
only O(n) intervals, can we do better?
An algorithm whose time complexity is a function of the output size is
called an output sensitive algorithm. Let NO be the number of intervals in a
string s of length n = NI . We next describe an output sensitive algorithm
that takes time
O(NO + NI ).
down the number of candidates for the checking at lines (6) and (7). Hence,
the authors Uno and Yagiura call this the Reduce Candidate (RC) algorithm.
We give the pseudocode of the algorithm as Algorithm (8).
(1) F OR i = n − 1 DOW N T O 1
(2) InsertLList(i, i + 1, s[i], L)
(3) InsertU List(i, i + 1, s[i], U )
(4) ScanpList(i, U, L)
(5) EN DF OR
We describe the algorithm and its various aspects in the following five parts.
We conclude with a concrete example.
1. Potent indices. We first identify certain j’s (index) called potent.4 For
a fixed i, for some i < jp ≤ n, let
But,
u(2, 7) = 7 and l(2, 7) = 1.
But j = 7 > j2 , hence j1 = 6 is not potent w.r.t. i = 2.
4 Uno and Yagiura in their paper use unnecessary j’s, which in a sense is complementary to
the idea of potent j. I define potent j’s for a possible simpler exposition.
284 Pattern Discovery in Bioinformatics: Theory & Algorithms
LEMMA 10.4
If [i, j] is an interval, then j must be potent w.r.t. i.
Then clearly
l(i, j) ≤ s[j ′ ] ≤ u(i, j)
which leads to a contradiction. Hence the assumption must be wrong.
Thus, in conclusion, only the potent j ′ s are sufficient to extract all the
intervals in s. In the algorithm, p-list is the list of potent j’s (in increasing
value of the index).
We begin by studying some key properties of u(i, j), l(i, j) and f (i, j) func-
tions.
LEMMA 10.5
(Monotone functions lemma) Let i ≥ 1 be fixed and for i < j1 < j2 ≤ n,
the following hold.
• (U.1) u(·, ·) is a nonincreasing function, i.e., u(i, j1 ) ≤ u(i, j2 ).
• (L.1) l(·) is a nondecreasing function, i.e., l(i, j1 ) ≥ l(i, j2 ).
LEMMA 10.6
• (F.2) Let 1 ≤ i1 < i2 < j1 < j2 ≤ n. Further, let the following hold.
u(i1 , j1 ) = u(i2 , j1 ) and l(i1 , j1 ) = l(i2 , j1 ) and
u(i1 , j2 ) = u(i2 , j2 ) and l(i1 , j2 ) = l(i2 , j2 ).
Then
f (i1 , j1 ) − f (i1 , j2 ) = f (i2 , j1 ) − f (i2 , j2 ).
Figure 10.5 illustrates the facts of the lemma for a simple example. Notice
that, for a fixed i (=1), the function u(i, j) is nondecreasing and l(i, j) is
nonincreasing as j increases. The p-list of potent j’s is shown at the bottom.
We explain a few facts here.
1. j = 3 is potent since it is the largest j with
u(i, j) = 6 and l(i, j) = 4.
9
↓ ↓ ↓ ↓ ↓
801
u(i,j) j 3 4 5 6 7 8
7
s[j] 5 8 7 3 9 2
601
6 5 8 7 3 9 2 u(i, j) 6 8 8 8 9 9
i=1 5 1
0 l(i, j) 5 5 5 3 3 2
4
l(i,j) 0
31 R(i, j) 1 3 3 5 6 7
2 r(i, j) 1 2 3 4 5 6
j
2 4 6 8
p f (i, j) 0 1 0 1 1 1
(a) s = 4 6 5 8 7 3 9 2. (b) i = 2.
FIGURE 10.5: Illustration of Lemmas (10.5) and (10.6). (a) The input
string s is shown in the center. The figure shows the snapshot of the u(i, j)
and l(i, j) functions when index i = 2 pointing to 6 in s. As j goes from
3 to 8: (1) u(i, j) is the nondecreasing function shown on top (U.1 of the
lemma), (2) l(i, j) is the nonincreasing function shown at the bottom (L.1 of
the lemma), and (3) each of five potent indices (j = 3, 5, 6, 7, 8) of the p-list
are shown as little hollow circles in the bottom row. Only two of the potent
j’s, j = 3, 5 evaluate f (i, j) to 0. These are shown as dark circles. (b) The
tabulated values of the functions. The potent j’s are marked by arrows.
Permutation Patterns 287
2. List of u(·, ·), l(·) functions. Consider the task of constructing the list
of u(·, ·) and l(·) functions:
For i = (n − 1), down to 1,
construct u(i, j) and l(i, j), for i < j ≤ n.
At iteration i, the function u(i, j) and l(i, j) is evaluated (or constructed
for the algorithm). At i, a straightforward (say like that of the algorithm of
Section 10.5.1) process scans the string from n down to i, taking O(i) time to
compute u(·, ·) and l(·, ·). Since there are n − 1 iterations and
(n − 1)n
1 + 2 + 3 + . . . + (n − 1) = ,
2
this task takes O(n2 ) time for all the n − 1 iterations.
The RC algorithm performs the above task in only O(n) time. This is done
by a clever update at each iteration in the following manner.
1. The u(·, ·) and l(·, ·) function is stored as a list with the ability to add
and remove from one of the list, called the head of the list. This is also
called the Last In First Out (LIFO) order of accessing elements in a list.
The algorithm maintains a U list to store values of u(·, ·) and an L list
to store l(·, ·). However, only distinct elements are stored, along with
the largest index j that has the value. Thus, if
u(i, j−1) < u(i, j) = u(i, j+1) = . . . = u(i, j+l) < u(i, j+l+1),
for some l, then s[j + l] is stored (along with the index (j + l)) in the list.
By the same reasoning, s[j − 1] is stored (along with the index (j − 1))
and is the head of the list if i = j − 2.
For example consider the following segment of s and let i = 2.
j 2 3 4 5 6 7
s[j] 4 3 7 6 1 5
u(2, j) → 7→ 7 → 6 →5→ 5
U −→ 7 → 6 −→ 5
Note that U has only three elements. The head of the list points to
element 7 (with index j = 4).
j 2 3 4 5 6 7
s[j] 4 3 7 6 1 5
l(2, j) → 1 → 1 → 1 → 1 → 5
L −→ 1 → 5
Note that L has only two elements. The head of the list points to
element 1 (with index j = 6).
2. At each iteration, an element may be is added to the list (U or L or
both), and zero, one or more consecutive elements may be removed in
order from the head of the list.
288 Pattern Discovery in Bioinformatics: Theory & Algorithms
This follows from Lemmas (10.7) and (10.8): the first deals with the U
list and the second is an identical statement for the L list. The following
can be verified and we leave the proof of these lemmas as an exercise
for the reader.
LEMMA 10.7
For a fixed i, consider the two functions
LEMMA 10.8
For a fixed i, consider the two functions
3. p-list of potent indices. Note that the lists U and L are already
sorted by j. Merging the two lists, gives the p-list or the list of potent j’s.
For example,
j 2 3 4 5 6 7
s[j] 4 3 7 6 1 5
U −→ 7 → 6 −→ 5
L −→ 1 → 5
p −→ → → →
p has four elements with the head pointing to index j = 4.
By Lemma (10.6), there is no interval of the form [i′ , j1 ] if
Hence, p-list can be pruned by removing j1 from the head of the list. We
make the following claim:
A p-list that is pruned only at the head of the list, possibly multiple
times, is such that for any two consecutive indices, j1 and j2 , in the
pruned list,
f (i, j1 ) ≤ f (i, j2 ).
This observation is crucial in asserting both the correctness and in the justi-
fication of the output-sensitive time complexity of the algorithm.
O(n)
O(NO )
O(NI + NO ).
(n − 1, n, s[n]).
The initialization is shown in Figure 10.6(1). To avoid clutter, only the value
u(i, j) and l(i, j) is shown in the U and L lists respectively, and i and j are
shown separately to track the iterations.
The algorithm loops through lines (1) through (5). At each iteration or
position i, the algorithm maintains the upper bound information u(i, j) and
the lower bound information l(i, j) for each i < j ≤ n in the two lists U and L
respectively. The two lists store only distinct elements as shown in the figure.
Recall from Lemma (10.5) that for a fixed i, as j goes from (i + 1) to n,
Permutation Patterns 291
U 9 U 9
j 9 j 8 9
L 2 L 3 2
p p
f=6 f=5 f=5
1465873 9 2 146587 3 92
↑ ↑
i=8 i=7
(1) (2)
U 7 9 U 8 9
j 7 8 9 j 6 7 8 9
L 3 2 L 7 3 2
p p
f=3f=4 f=0f=3f=3
14658 7 392 1465 8 7392
↑ ↑ [5, 6]
i=6 i=5
(3) (4)
U 8 9 U 6 8 9
j 5 6 7 8 9 j 4 5 6 7 8 9
L 5 3 2 L 5 3 2
p p
f=1 f=2 f=0 f=2
146 5 87392 14 6 587392
↑ ↑ [3, 4]
i=4 i=3
(5) (6)
U 6 8 9
j 3 4 5 6 7 8 9
L 4 3 2
p
f=0 f = 0 f = 0f = 0 f = 0
1 4 6587392
↑ [2, 4], [2, 6], [2, 7], [2, 8], [2, 9]
i=2
(7)
U 4 6 8 9
j 2 3 4 5 6 7 8 9
L 1
p
f=2 f=2 f=1 f=0
1 4 6587392
↑ [1, 9]
i=1
(8)
U 9
j 2 3 4 5 6 7 8 9
L 1
p
f=0
(9)
Thus, by the monotonicity of functions u(·, ·) and l(·, ·)), a new element in the
list is only added at the head of the list. So this operation takes
O(1)
Thus if the new element s[i − 1] is added to the list, it can only be the head
of the list.
The list of potent j’s, the p-list in Figure 10.6 can be computed from the
U list and L list by traversing the two lists from the head and using the pair
(i, j ′ ) where j ′ is the largest j such that
For example, consider Figure 10.6(6). Here i = 3 and the four potent j’s
are:
j 2 3 4 5 6 7 8 9 10 11 12 13 14
p
f=0 f=0 f=0 f=0f=0f=0 f=0f>0
FIGURE 10.8: Here i = 1 and the three irreducible intervals are shown
by arrows at j = 3, j = 9 and j = 13 representing intervals [1, 3], [1, 9] and
[1, 13] respectively. Intervals [1, 5] [1, 7], [1, 10] and [1, 11] are not irreducible,
however f (1, j) = 0, for j = 5, 7, 10, 11. While scanning p list for irreducible
intervals of the form [1, ·], the j’s for which f (i, j) is actually evaluated are
j = 3, 9, 13, 14. The scanning terminates when f (i, j) > 0 (here at j = 14).
LEMMA 10.9
Let
1 ≤ i < j1 < j < j2 ≤ n.
Then if
[i, j1 ] and [j1 , j2 ] are intervals,
then
[i, j]
is not a irreducible interval.
PROOF We first show that [j1 , j] is an interval: the proof of this statement
is not very difficult and left as an exercise for the reader (Exercise 115). Next,
the interval [i, j] cannot be irreducible, since there are two other intervals
[i, j1 ] and [j1 , j2 ] that overlap and their union is [i, j].
3
2. The scanning of p-list now jumps to the element following jmax = 7,
which in this example is 9.
f (1, 9) evaluates to 0.
9 9
Again jmin = 10 and jmax = 11, which had been computed before.
9
3. The scanning of p-list now jumps to the element following jmax = 11,
which here is 13.
f (1, 13) evaluates to 0, but there are no intervals of the form [13, ·].
4. So, the scanning continues to the next element on the list, 14.
f (1, 14) evaluates to a nonzero value and the scanning stops.
1 1
Next, jmin is updated to 3 and jmax is updated to 13, for subsequent iterations.
FOR i = n − 1 DOWNTO 1 DO
InsertLList(i,i+1,s[i],LHd)
InsertUList(i,i+1,s[i],UHd)
=⇒ ScanpListirreducible(i,UHd,LHd)
i i
=⇒ Update jmin , jmax
ENDFOR
To summarize, the RC intervals algorithm can be modified to compute the
irreducible intervals and this is shown as Algorithm (10). The lines marked
with right arrows on the left are the new statements introduced here.
The last paragraph summarized the ScanpListirreducible(·) routine, and the
workings of the other routines are straightforward and are left as an exercise
for the reader.
A complete example of computing irreducible intervals on s = 3 2 4 6 5 7 8 1 9
is shown in Figure 10.9.
Correctness of algorithm (10). The algorithm emits the same intervals
as the RC algorithm except the ones suppressed by the scan jumps. This is
straightforward to see from Lemma (10.9).
We first establish a connection between irreducible intervals and PQ trees.
We postpone the analysis of the complexity of the algorithm to after this
discussion.
j 7 8 9 j 6 7 8 9
(7,8) (1,8) (1,9) (5,7) (5,8) (1,8) (1,9)
p p
f=0f=5 f=1f=1f=4
32465 7 819 3246 5 7819
↑ [6, 7] ↑
i=6 i=5
6 6
(1) jmin = jmax =7 (2)
j 5 6 7 8 9 j 4 5 6 7 8 9
(5,6) (5,7) (5,8) (1,8) (1,9) (4,6) (4,7) (4,8) (1,8) (1,9)
p p
f=0f=0 f=3 f=0f=0 f=2
324 6 57819 32 4 657819
↑ [4, 5], [4, 7] ↑ [3, 5], [3, 7]
i=4 i=3
4 4 3 3
(3) jmin = 5, jmax =7 (4) jmin = 5, jmax =7
j 3 4 5 6 7 8 9 j 2 3 4 5 6 7 8 9
(2,4) (2,6) (2,7) (2,8) (1,8) (1,9) (2,3) (2,4) (2,6) (2,7) (2,8) (1,8) (1,9)
p p
f=1 f=1f=1f=1f=2 f=0f=0 f=0f=0
3 2 4657819 3 2 4657819
[1,2], [1,3],
↑ ↑
[1,8], [1,9]
i=2 i=1
1 1
(5) (6) jmin = 2, jmax =9
v1 v2 . . . vl .
Then for 1 ≤ j1 ≤ j2 ≤ l,
Let V1 be the set of P nodes and V2 the set of Q nodes in T . Then the set of
intervals encoded by this PQ tree T is:
! !
[ [ [
I(T ) = {I(v)} I(v) (10.5)
v∈V1 v∈V2
3. (disjoint) k11 < k12 < k21 < k22 , without loss of generality.
j1 , j2 , . . . , jl−1 , jl ,
s = 8 9 1 4 6 3 5 2 7.
j = 6 down to j = 4
Permutation Patterns 301
have no pointers, so
s[6], s[5], s[4]
are collected as children. These 4 children are assembled together as a
P node as shown in (1).
j = 7 maintains a unidirectional pointer, u-ptr, to this P node and
j = 4 maintains a bidirectional pointer, l-ptr. Both are shown as dashed
curves in the figure.
In other words the interval spanned by the P node is captured through
the u-ptr and the l-ptr. The u-ptr of j = 7 and the l=ptr of j = 4 are
updated to point to the constructed P node as shown in (1).
2. Next consider irreducible interval [4, 8] (Figure 10.11(2)). j = 8 has no
pointers, so s[8] is collected as a child node. But j = 7 has a u-ptr
pointing to the P node which points to j = 4 via the bidirectional l-ptr.
Hence the P node and s[8] are collected as children.
As there are only two elements a Q node is constructed with these two
as children. This is shown in (2). The u-ptr of j = 8 and the l=ptr of
j = 4 are updated to point to the constructed Q node as shown in (2).
3. Similarly irreducible interval
[4, 9]
is processed and is shown in Figure 10.11(3).
Next at j = 1, two irreducible intervals [1, 2] and [1, 9] are emitted. They
are also considered in the increasing order of their sizes.
1. First [1, 2] is processed.
j = 2 has no pointers and clearly j = 1 has no pointers either, so s[2]
and s[1] are collected as children. Since there are only two children, a
Q node is constructed with these two as children as shown in (4).
The u-ptr of j = 2 and the l=ptr of j = 1 are updated to point to the
freshly constructed Q node.
2. Next [1, 9] is processed.
j = 9 has a u-ptr to a Q node that points to j = 4 via the l-ptr. So the
Q node is assembled as a child.
The next considered is j = 3 (to the immediate left of the l-ptr of the
Q node). This has no pointers, s[3] is assembled as a child.
Next j = 2 (immediate left of j = 3) is considered. This has a u-ptr
pointing to a Q node, whose l-ptr points to 1. Thus this Q node is
assembled as a child and the scanning stops.
Since there are three children a P node is constructed with these three
children as shown in (4).
This completes the example. Figure 10.13 describes another example. Here
we illustrate a case when j = 2 at Figure 10.13(4) has a l-ptr (but no u-
ptr). In this case s[2] will be collected as a sibling, not child, as shown in
Figure 10.13(5).
302 Pattern Discovery in Bioinformatics: Theory & Algorithms
LEMMA 10.10
At every iteration, j has no more than 1 pointer. The pointer is either a u-ptr
or a l-ptr.
PROOF Assume cell j has two u-ptrs, then there are two irreducible
intervals of the form [·, j]. Clearly one is contained in the other, hence must be
a child (or descendent) of the other. By the step shown as a boxed statement
in the pseudocode of Algorithm (11), the pointers of the child (or descendent)
have been removed, leading to a contradiction. Similarly j can not have
multiple l-ptrs.
Next assume that cell j has a u-ptr and an l-ptr. Then they must have a
parent Q node and by the boxed statement of the algorithm, the children’s
pointers are removed, leading to a contradiction.
Permutation Patterns 303
j 5 6 7 8 9 j 5 6 7 8 9
2
4 5 4 5
6 3 6 3
(1) (2)
j 5 6 7 8 9
7
2
4 5
6 3
891 4 63527
↑ [4, 9]
i=4
(3)
j 2 3 4 5 6 7 8 9 j 2 3 4 5 6 7 8 9
1
7 7
8 9 8 9
2 2
4 5 4 5
6 3 6 3
8 9 1463527 8 9 1463527
↑ [1.2] ↑ [1, 9]
i=1 i=1
(4) (5)
1
7
8 9
2
4 5
6 3
891463527
(6)
j 5 6 7 8 9 10 j 3 4 5 6 7 8 9 10
1
0 6
8 9
3 5 2 4 3 5 2 4
(1) (3)
j 5 6 7 8 9 10 j 2 3 4 5 6 7 8 9 10
1 1
0 6 0 6
78 9
3 5 2 4 3 5 2 4
(2) (4)
j 2 3 4 5 6 7 8 9 10
1
0 6
78 9
3 5 2 4
7 8 93524061
↑ [1, 10]
i=1
(5)
THEOREM 10.2
(Irreducible intervals theorem) Consider s, a permutation of integers
1, 2, . . . , n. Let I be the set of all intervals on s and let M be the set of all
irreducible intervals on s. Then the following statements hold.
I = B(M ).
5
3. The size of M is bounded by n, i.e.,
|M | < n.
s = 1 2 3 4 . . . n.
Then
M = {{1, 2}, {2, 3}, . . . , {n − 1, n}}.
Thus |M | = (n − 1) and the bound is tight.
A maximal permutation pattern is relevant in the context of multiply ap-
pearing characters or patterns that appear only in a subset (not necessarily
all) of the collection of sequences. Now it is easy to see that both the algo-
rithms take O(n) time. It is clear that Algorithm (10) is linear in the size of
the output. Since the number of irreducible intervals is no more than n, the
algorithm takes O(n) time.
It is easy to see in Algorithm (11) that each cell is scanned once. The
number of internal nodes is bounded by n. Thus the algorithm takes O(n)
time.
10.7 Applications
Genes that appear together consistently across genomes are believed to be
functionally related: these genes in each others’ neighborhood often code for
proteins that interact with one another suggesting a common functional asso-
ciation. However, the order of the genes in the chromosomes may not be the
same. In other words, a group of genes appear in different permutations in
the genomes [MPN+ 99, OFD+ 99, SLBH00]. For example in plants, the ma-
jority of snoRNA genes are organized in polycistrons and transcribed as poly-
cistronic precursor snoRNAs [BCL+ 01]. Also, the olfactory receptor(OR)-
gene superfamily is the largest in the mammalian genome. Several of the
human OR genes appear in cluster with ten or more members located on al-
most all human chromosomes and some chromosomes contain more than one
cluster [GBM+ 01].
As the available number of complete genome sequences of organisms grows,
it becomes a fertile ground for investigation along the direction of detecting
gene clusters by comparative analysis of the genomes. A gene g is compared
with its orthologs g ′ in the different organism genomes. Even phylogenetically
close species are not immune from gene shuffling, such as in Haemophilus
influenzae and Escherichia Coli [WMIG97, SMA+ 97]. Also, a multicistronic
gene cluster sometimes results from horizontal transfer between species [LR96]
and multiple genes in a bacterial operon fuse into a single gene encoding multi-
domain protein in eukaryotic genomes [MPN+ 99].
308 Pattern Discovery in Bioinformatics: Theory & Algorithms
1, 2, 3, . . ., 25422,
1, 2, 3, . . ., 25422
obtained from the SLAM output table. The full mapping can be found in:
http://crilx2.hevra.haifa.ac.il/∼orenw/MappingTable.ps.
Ignoring the trivial permutation pattern involving all the genes, there are
only 504 interesting maximal ones out of 1,574,312 permutation patterns in
this data set. In Figure 10.15 a subtree of the Human-Rat whole genome PQ
Permutation Patterns 309
1997
1998
2017
2018
2019
2025
2026
2027
2040
2041
2042
2043
2044
2045
2122
2123
2124
2125
FIGURE 10.15: A subtree of the common maximal permutation pattern
PQ tree of human and rat orthologous genes.
(1997 − 2125)
(2043 − 2041, 2025 − 2018, 2123 − 2125, 2122 − 2044, 2040 − 2026, 2017 − 1997).
Human chromosome 1:
ABC DE F GH I J
Rat chromosome 13:
J I H GDBF E C A
A 1988 − 2013
B 2014 − 2021
C 2022 − 2036
A D 2037 − 2039
E 2040 − 2118
F 2119 − 2121
G 2122 − 2128
B C D GH I J
H 2129 − 2130
I 2131 − 2141
E F J 2142 − 2153.
(1) 66 genes cluster.
Human chromosome 9:
A 55 56 57 58 59 60 61 62 C
Rat chromosome 5:
A C A 57 59 55 60 56 62 58 61 C
A 12745 − 12754
B 12755 − 12762
55 62 C 12763 − 12791
(2) 47 genes cluster.
Human chromosome 10:
ABC DE F
Rat chromosome 17:
E CA F B D
A 13544 − 13553
B 13554 − 13556
C 13557 − 13562
D 13563
E 13564 − 13573
A F F 13574
(3) 31 genes cluster.
FIGURE 10.16: Examples of common gene clusters of human and rat. See
text for details.
Permutation Patterns 311
10.8 Conclusion
Although permutation patterns have been studied more recently than sub-
string patterns, their usefulness can not be underestimated. The notion of
maximality in this new context is particularly interesting since it provides a
purely combinatorial way of cutting down on the output size without com-
promising on any information content. We end the chapter by reiterating the
dramatic reduction in the output size simply by the use of maximality on two
biological data sets.
Number of Number of
all patterns maximal patterns
10.9 Exercises
Exercise 99 (Maximality) Prove that if p2 is nonmaximal with respect to
p1 , then p1 ⊂ p2 . Is the converse true? Why?
Exercise 101 Consider the permutation patterns shown in Figure 10.1. Which
of these are maximal? Give the PQ tree representation of the maximal pat-
terns.
Hint: p = a-b-c-d-e occurs at locations 1, 6 and 11 on the input s. Can every
other pattern be deduced from p?
Hint: Does the following PQ tree capture all the nonmaximal patterns of p
given by Equation (10.4) in Section 10.3.2?
x c
d e a b
Note that only one leaf node is labeled with c, although the multiplicity of c
is 2 in p. In the following example, a single PQ tree cannot represent the two
nonmaximal patterns {a, b, c, d} and {c, d, e, f } in the two occurrences:
o1 = c b d a g e f d c ,
o2 = d g c a b c d e f .
o1 = d e a b c x c,
o2 = c d e a b x c,
o3 = c x c b a e d.
Exercise 104 Can the running time of the Parikh Mapping-based algorithm
discussed in this chapter be improved?
Hint: Is it possible to reduce factor (log t)2 to log t in the time complexity?
Recall that the tags are assigned in increasing order. Let at stage j, tj be the
largest assigned integer to a tag. If the newly encountered tag (t′1 , t′2 ) is such
314 Pattern Discovery in Bioinformatics: Theory & Algorithms
that, t′1 , t′2 ≤ tj , then the first of the tag pair can be stored in an array and
directly accessed in O(1) time reducing one of the O(log t) factors to O(1).
This can be made possible if all the entries in the Ψ array are known in
advance. This can be simply done by a linear scan of the input with the L-sized
window and recording all the distinct numbers in Ψ that are generated by the
L-sized window. Let the largest number encountered be t∗0 . Note that t∗0 ≤ L,
by the choice of the window size. Then the tag values are assigned starting
with t∗0 , thus every new number tnew encountered is such that tnew ≤ t∗0 .
p = {a, b, c}
o = a b b c a.
2. Given s and a fixed i, show that if j(> i) is not potent with respect to
i, then [i′ , j] is not an interval for all i′ ≤ i.
Hint: (1) Are all elements of s distinct? (2) See the proof of Lemma (10.4).
Permutation Patterns 315
Exercise 109 Give arguments to show that statement (F.2) of Lemma (10.6)
is equivalent to the following statement: If
for
1 < i < j1 < j2 ≤ n,
then
f (i − 1, j1 ) − f (i − 1, j2 ) = f (i, j1 ) − f (i, j2 ).
Prove the above statement or (F.2).
9 ↓ ↓ ↓ ↓ ↓
8 j 2 3 4 5 6 7 8
u(i,j)
7 s[j] 6 5 8 7 3 9 2
6
u(i, j) 6 6 8 8 8 9 9
4 6 5 8 7 3 9 2
l(i, j) 4 4 4 4 3 3 2
i=14
l(i,j) 3
R(i, j) 2 2 4 4 5 6 7
2 r(i, j) 1 2 3 4 5 6 7
j
2 4 6 8
p f (i, j) 1 0 1 0 0 0 0
(a) s = 4 6 5 8 7 3 9 2. (b) i = 1.
Exercise 112 Give a pseudocode description, along the lines of the subrou-
tines in Algorithm (9), of the following three routines:
3. ScanpList(i, j, v, Hd).
Exercise 113 Let s of length n be such that each element is distinct and
Π(s) ⊂ {1, 2, . . . , N }
and 1, N ∈ Π(s), for some N > n. Does Algorithm (8) work for this input s?
1. For the upper-diagonal matrices below, does the Monge Property hold (when
ever the matrix elements are defined)?
j′ j j′ j j′ j
6 6 7 7 9 9 9 9 9 6 4 4 2 2 1 1 1 1 0 1 1 2 1 3 2 1 0
4 7 7 9 9 9 9 9 4 4 2 2 1 1 1 1 0 2 3 4 4 3 2 1
7 7 9 9 9 9 9 7 2 2 1 1 1 1 0 4 5 5 4 3 2
2 9 9 9 9 9 i′ 2 2 1 1 1 1 i′ 0 6 6 5 4 3 i′
9 9 9 9 9 9 1 1 1 1 0 7 6 5 4
1 8 8 8 i 1 1 1 1 i 0 6 5 4 i
8 8 8 8 3 3 0 4 3
3 5 3 3 0 1
5 5 0
(a)Mu . (b)Ml . (c)Mf .
2. For
1 ≤ i′ < i < j < j ′ ≤ n
show that
f (i′ , j) + f (i, j ′ ) ≥ f (i, j) + f (i′ , j ′ ).
Hint: 1. A row i in Mu is u(i, j), in Ml is l(i, j), and in Mf is f (i, j), for
j ≥ i for some sequence s.
2. Show that
I = B(M ).
2. M is unique, and
3. |M | < n.
1 1
1
8
7
8 6 78 6
2 3 4 5 6 7 2 3 4 5 =⇒ 2 3 4 5
′
(a) T = T . (b) T. (c) T ′ .
s2 = 3 5 2 4 7 6 8 1 s3 = 3 5 2 4 6 7 8 1
Exercise 117 Consider Algorithm (10). If the scanning of the input is switched
to left-to-right (instead of right-to-left as in the current description), does the
algorithm emit the same irreducible intervals? Why?
Here
Similarly, one or more A, G and T can be cleaved giving rise to more frag-
ments.
Assume an assay DNA technology (MALDI-TOF mass spectrometry [B0̈4]),
that reads only Π′ (f )(also called a compomer) for each fragment f . In this
example, the complete collection of compomers is as follows:
1. (cleaved by C): {0, A}, {C, G, T (2), 1}, {0, A, C}, {G, T (2), 1},
2. (cleaved by A): {C(2), G(T )2, 1},
3. (cleaved by G): {0, A, C(2)},{T (2), 1},
4. (cleaved by T ): {0, A, C(2), G}, {T, 1}, {0, A, C(2), T, G}, {1}
Is it possible to reconstruct the original s from this collection of compomers?
1. The task is to formulate the problem that can address the above.
2. Consider the following two examples. What are the common permuta-
tions (clusters) in both the graphs in (a) and (b)?
F F
G E G E
D D
H H
J J
A A
I I
B B
C C
(a)
F A I A E
F
E B C H B C
G
G D H I D
(b)
G(V, Ei ), 1 ≤ i ≤ n,
and a quorum K > 1 and a size m, the problem is to find all the maximal
V ′ ⊆ V with |V ′ | ≥ m,
Although partitive families can be quite large (even exponentially large), they
have a compact, recursive representation in the form of a tree, where the leaves
are the singletons of V , namely, the decomposition tree:
THEOREM 10.3
(Decomposition theorem) [MCM81] There are exactly three classes of
internal nodes in a decomposition tree of a partitive family.
a. A P rime node is such that none of its children belongs to the family,
except for the node itself.
b. A Degenerate node is such that every union of its children belongs to
the family.
c. A Linear node is given with an ordering on its children such that a union
of children belongs to the family if and only if they are consecutive in
the ordering.
D I J A
E F
I H G D
G H
C F
E B A C
(a) (b)
Comments
I particularly like this chapter since it is a nice example of the marriage
of elegant theory and useful practice. Usually, the idea of maximality of
patterns is very important and in the case of permutation patterns, it is also
nonobvious. Further, it beautifully fits in with the PQ tree data structure and
Parikh mapping, both having been well studied independently in literature.
Chapter 11
Permutation Pattern Probabilities
11.1 Introduction
Just as it is reasonable to compute the odds of seeing a string pattern in
a random sequence, so is the case with permutation patterns. We categorize
permutation patterns as (1) unstructured and (2) structured.
The former usually refers to the case where these patterns (or clusters) are
observed in sequences, usually defined on fairly large alphabet sets.
The structured permutations refer to PQ trees, that is the encapsulation of
the common internal structure across all the occurrences of the permutation
pattern. The question here is regarding the odds of seeing this structure (as
a PQ tree) in a random sequence.
i1 number of A’s,
i2 number of C’s,
i4 number of T’s,
with i1 + i2 + i3 + i4 = n,
323
324 Pattern Discovery in Bioinformatics: Theory & Algorithms
This is a pattern p where the order does not matter (also called a permu-
tation pattern in Chapter 10) and is written as
be an n-mer with
i1 number of A’s,
i2 number of C’s,
i3 number of G’s and
i4 number of T’s
and let pX be the probability of occurrence of X where
X = A, C, G or T
with
pA + pC + pG + pT = 1.
Then the probability measure function
MP : Ω → R≥0 ,
is defined as follows:
(i1 + i2 + i3 + i4 )!
MP (ωi1 ,i2 ,i3 ,i4 ) = (pA )i1 (pC )i2 (pG )i3 (pT )i4
i1 ! i2 ! i3 ! i4 !
n!
= (pA )i1 (pC )i2 (pG )i3 (pT )i4 .
i1 ! i2 ! i3 ! i4 !
In particular, if the four nucleotides, A, C, G, T are equiprobable, then the
formula simplifies to
n! 1
MP (ωi1 ,i2 ,i3 ,i4 ) = .
i1 ! i2 ! i3 ! i4 ! 4n
How do we get this formula? And, does it satisfy the probability mass condi-
tions (see Section 3.2.4)?
To address these curiosities, we pose the following general question where
we use m instead of 4.
What is the number of distinct strings where each has exactly i1 number
of x1 ’s, i2 number of x2 ’s, . . ., im number of xm ’s?
This is not a very difficult computation, but we also need to show its relation
to a probability mass function. Hence we take a ‘multinomial’ view of the
problem: It turns out that this number is precisely the multinomial coefficient
in combinatorics. This is one of the easiest ways of computing this number
and we study that in the next section. The summary of the discussion is as
follows.
1. MP (ωi1 ,i2 ,i3 ,i4 ) is computed from the multinomial coefficient (divided
by mn ), and
2. P (Ω) = 1 follows from Equation (11.3).
If
i1 + i2 + . . . + im = n,
then each string is of length n. As an example, let
m = 2, n = 3 and i1 = 1.
ω1 = x1 x2 x2 ,
ω2 = x2 x1 x2 and
ω3 = x2 x2 x1 .
Let Sig(m, n) be the set of all possible signatures for the given m and n. For
instance,
Sig(2, 2) = {[2, 0], [1, 1], [0, 2]}.
2 This is also called the Parikh vector and is discussed in Chapter 10.
326 Pattern Discovery in Bioinformatics: Theory & Algorithms
I : Ω → Sig(m, n),
where
For m > 0 and n ≥ 0, the following can be verified (with some patience):
X n!
n
(x1 +x2 +. . .+xm ) = xi11 xi22 . . . ximm (11.1)
i1 ! i2 ! . . . im !
Ψ∈Sig(m,n)
The number
n!
, (11.2)
i1 ! i2 ! . . . im !
m = 2 and n = 3
m = 3 and n = 2
Thus,
X M C(m, n)
= 1. (11.3)
mn
Ψ∈Sig(m,n)
Further, let
k = i1 + i2 + . . . + il ≤ n.
For each distinct occurrence, the probability of its occurrence is given as:
(pσ1 )i1 (pσ2 )i2 . . . (pσl )il ((1 − pσ1 )(1 − pσ2 ) . . . (1 − pσl ))n−k . (11.5)
i.e., the events are disjoint for any pair j1 6= j2 . The proof of this statement
is left as Exercise 121 for the reader.
Next, using Equations (11.4) and (11.5), the answer to the first problem,
denoted as Pi1 +i2 +...+il , is given as
n (i1 + i2 + . . . + il )!
Pi1 +i2 +...+il =
i1 + i2 + . . . + il i1 ! i2 ! . . . il !
(pσ1 )i1 (pσ2 )i2 . . . (pσl )il ((1 − pσ1 )(1 − pσ2 ) . . . (1 − pσl ))n−k .
For
j = (i′1 , i′2 , . . . , i′l ) ∈ C,
let Ej denote the event that σ1 , σ2 , . . . , σl occur exactly i′1 , i′2 , . . . , i′l times
respectively. Then
Ej1 ∩ Ej2 = ∅, (11.7)
i.e., the events are disjoint for any pair j1 6= j2 (∈ C). We leave the proof of
this as an exercise for the reader (Exercise 121).
Since the events are disjoint, the answer to the second problem, denoted as
Pi′1 +i2 +...+il , is obtained using the solution to Problem 1:
X
Pi′1 +i2 +...+il = Pi′1 +i′2 +...+i′l (11.8)
(i′1 ,i′2 ,...,i′l )∈C
The reader is also directed to [DS03, HSD05] for results on real data and
generalizations to gapped permutation patterns.
q = {σ1 , σ2 , . . . , σl },
that occurs k times in the input, what is the p-value, pr(T, k), of its maximal
form given as a PQ tree T ?
What does it mean to compute this probability? We give an exposition
based on explicit counting below.
11.3.1 P -arrangement
i, i + 1, i + 2, . . . , i + k − 1,
and its inversion is obtained by reading the elements from right to left. For
example, q1 and q2 shown below are arrangements of sizes 5 and 3 respectively.
Note that
[1..k]
Permutation Pattern Probabilities 331
is always an interval, hence is called the trivial interval. Every other interval
is nontrivial. See the examples below for illustration.
q1 q2
interval Π(q1 [k1 ..k2 ]) size
[k1 ..k2 ] interval Π(q2 [k1 ..k2 ]) size
[k1 ..k2 ]
[3..4] 52 43 1 {3, 4} 2
[2..4] 5 243 1 {2, 3, 4} 3 [1..2] 45 6 {4, 5} 2
[1..4] 5243 1 {2, 3, 4, 5} 4 [2..3] 4 56 {5, 6} 2
Base cases. Note that for k = 1, the problem is not defined since the size
of an interval is at least two. For k = 2, the P -arrangements are as follows:
12
1 32 ,
2 1 3,
3 2 1.
3142
q4 = 3 1 4 2,
5 3 1 4 2,
5 3142 .
Similarly, adding element 5 at the other end will give a nontrivial interval.
Also, if element 5 is inserted next to 4 as shown below
3 1 5 4 2,
q5 = 3 5 1 4 2.
LEMMA 11.1
Let q be a P -arrangement of
1, 2, . . . , k.
′
Let q be constructed from from q by inserting element k + 1 at any of the
k − 3 positions in q that is not an end position and not adjacent to element
k. Then q ′ is a P -arrangement of 1, 2, . . . , k, k + 1.
Does the converse of Lemma (11.1) hold? In other words, is it true that
removing element k + 1 from any P -arrangement of
1, 2, . . . , k, k + 1
4 21 3
It turns out the the smallest nontrivial interval is nested in the others, i.e.,
this interval is a subset of the others. An interval [i11 . . . i12 ] is a subset of
[i21 . . . i22 ], written as
[i11 . . . i12 ] ⊂ [i21 . . . i22 ]
if and only if the following holds:
In fact, in this example, all the intervals are nested, since otherwise the ar-
rangement of size 5 will not be a P -arrangement. This observation can be
generalized as the following lemma.
LEMMA 11.2
Let q be a P -arrangement of
1, 2, . . . , k, k + 1.
334 Pattern Discovery in Bioinformatics: Theory & Algorithms
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
The sizes of the intervals in q are: 4, 5, 6, 7, 8, 9, 10, and there are two
intervals of size 5. We observe the following about nested intervals in an
arrangement.
LEMMA 11.3
Let q be an arrangement such that it has r nested nontrivial intervals
[i11 ..i12 ] ⊂ [i21 ..i22 ] ⊂ .. ⊂ [ir1 ..ir2 ].
Then for each j = 1, 2, . . . , r,
q[ij1 ..ij2 ]
is a P -arrangement of size ij2 − ij1 + 1.
This gives a handle on designing a method for counting (as well as enumer-
ating) arrangements with nested nontrivial intervals. Consider the following
arrangement of size 10 with nested intervals as shown:
8 10 6 3 1 4 2 7 5 9
Note that the smallest interval is a P -arrangement, and when the smallest
interval is replaced by its extreme (largest here) element, shown in bold below,
8 10 6 4 7 5 9
Permutation Pattern Probabilities 335
31 4 65 2 → 31 45 2 → 3142
9 1 3 4-7 8 2 10 9 1 3 4-7 8 2 10
↓ ↓
9 1 3-7 8 2 10 9 1 3 4-8 2 10
↓ ↓
9 1 3-8 2 10 9 1 3-8 2 10
↓ ↓
9 1 2-8 10 9 1 2-8 10
↓ ↓
9 1-8 10 9 1-8 10
↓ ↓
1-9 10 1-9 10
↓ ↓
1-10 1-10
THEOREM 11.1
(P -arrangement theorem) Let q be a P -arrangement of size k + 1. Let
q ′ be obtained by replacing an extreme element (either k + 1 or 1) from its
position j in q, with the empty symbol. Then only one of the following holds.
336 Pattern Discovery in Bioinformatics: Theory & Algorithms
i1 < j < i2 .
ij = ij2 − ij1 + 1.
Then
i1 < i2 < . . . < ir−1 < ir .
Then the telescoping sizes, xj (1 ≤ j ≤ r), are defined as:
x1 = i1 ,
xj = ij − ij−1 + 1,
2. S(u, l): Let S(u, l), u ≥ l, denote the number of arrangements of size u
that has only nested intervals and the size of the smallest interval is l.
3. N st(k), Nst’(k):
(a) Let N st(k) denote the number of arrangements of size k that have
only nested intervals. Then N st(k) can be defined in terms of S(·, ·)
as follows:
k
X
N st(k) = S(k, l). (11.9)
l=2
Consider N st′ (k) of Equation (11.10). Note that the smallest interval
of size l may contain the element k, thus placing the element k + 1 in
this interval gives a nontrivial interval resulting in an over estimation of
the number of P -arrangements. Thus using Theorem (11.1) we get the
following:
Estimating S(u, l). Now we are ready to define S(u, l), u ≥ l > 1, in terms
of P a(u′ ) where u′ < u.
First, consider the case when there is exactly one nontrivial interval. The
single nontrivial interval is of size l. Then we can consider a P -arrangement
of size u − l + 1 and each position in this arrangement can be replaced by yet
another P -arrangement of the remaining l elements giving the following:
S(l, l) = P a(l)
3. each successive interval differs from the next in size by at least two and
Note that the telescoping sizes (see Section 11.3.2) are as follows:
x1 = i1 .
xk1 = ir − ik1 + 1,
xk2 = ir − ik2 + 1.
Let the number of such arrangements be N and our task is compute N . Note
that we know neither the exact number of nested intervals nor the size of each
interval. But this does not matter.
P -arrangement of size > 2. Note that a P -arrangement of size l > 2 is such
that the largest element is never at the ends of the arrangement. Thus this
arrangement (or its inversion) can be inserted within another P -arrangement
without producing intervals that are nested.
First, we compute N1 , the number of arrangements with the size of largest
nontrivial interval as i1 . Using the principle used in Exercise 127 we obtain
Similarly, we get
N2 ≤ xk2 P a(xk2 )S(ik2 , l).
Thus the required number is
N = N1 + N2
≤ xk1 P a(xk1 )S(ik1 , l) + xk2 P a(xk2 )S(ik2 , l)
X
= xj P a(xj )S(ij , l)
j=k1 ,k2
X
= P a(∆ + 1)(∆ + 1)S(u − ∆, l).
∆=u−ik1 ,u−ik2
Back to computation. This sets the stage for computing the number of
arrangements where the size of the largest nontrivial interval takes all possible
values. In other words,
∆ = 1, 2, . . . , (u − l).
Thus, in the general case, we get
u−l
X
S(u, l) ≤ 2(∆ + 1)P a(∆ + 1)S(u − ∆, l). (11.11)
∆=1
Thus, to summarize,
k−2
X
P a(k) ≤ (k − 4)P a(k − 1) + S(k − 1, l)(l − 1). (11.12)
l=2
Permutation Pattern Probabilities 339
P a(2) = 2,
P a(3) = 0,
P a(4) = 2,
k−2
X
P a(k) ≤ (k − 4)P a(k − 1) + S(k − 1, l)(l − 1).
l=2
2. For k ≥ l > 1,
S(l, l) = P a(l),
u−l
X
S(u, l) ≤ (∆ + 1)P a(∆ + 1)S(u − ∆, l).
∆=1
3. For k > 1.
k
X
N st′ (k) ≤ S(k, l)(l − 1).
l=2
Figure 11.1 shows the order in which the functions can be evaluated. For
convenience, it has been broken down into four phases as shown. To avoid
clutter, the functions P a(·), S(·, ·) and N st′ (·) also refer to the one, two and
one dimensional arrays respectively that store the values of the functions as
shown in the figure.
3 The ‘programming’ refers to the particular order in which the tables are filled up and does
not refer to ‘computer programming’.
340 Pattern Discovery in Bioinformatics: Theory & Algorithms
FIGURE 11.1: The three arrays that store the values of P a(·), S(·, ·) and
N st′ (·). The order in which the different functions are evaluated and stored
in the arrays in a dynamic programming approach
√ is shown above in the first
four phases: I, II, III and IV. The check ( ) entry indicates that the function
has been evaluated at that point. See text for details.
Permutation Pattern Probabilities 341
What is the size of F r(T )? The burning question of this section is: Given
a PQ tree T with k leaf nodes, what is the size of F r(T )?
In other words, what is the number of arrangements that encode exactly
the same subsets of Σ as T ?
We define #(A), for each node A of T as follows. Let node A in the PQ
tree T have c children A1 , A2 , . . . , Ac . Then
1 if A is a leaf node,
2 cj=1 #(Aj )
Q
#(A) = if A is a Q node, (11.16)
P a(c) cj=1 #(Aj )
Q
if A is a P node.
2 (8.1) = 16 D
2 (2.2.1.1) = 8 C 9
2(1.1.1.1) = 2 2(1.1) = 2
A B 7 8
1 2 3 4 5 6
FIGURE 11.2: Computation of #(X) for each node X in the PQ tree
using Equation (11.16). Note that the internal nodes are labeled A, . . . , D
and #(A) = #(B) = 2, #(C) = 8, and #(D) = 16.
C Node A
3142 2413
Node B
A B 567 765
Node C
a b c d e f g AB BA
a b c d e f g 1 2 3 4 5 6 7
(1) The input PQ tree T . (2) Numbering the leaf nodes (3) The possible
& labeling the internal nodes. arrangmeents.
1234567 abcdefg
3142567 765 2413 cadbefg gfe bdac
3142765 5672 413 cadbgfe efgb dac
2413567 7653142 bdacefg gfecadb
2413765 5673142 bdacgfe efgcadb
7654321 gfedcba
(4) The 10 possible arrangements. (5) Arrangements in the input alpahabet.
FIGURE 11.3: Different steps involved in computing |F r(T )| are shown
in (1), (2) and (3). The different arrangements are shown in (4) and (5) above.
Permutation Pattern Probabilities 345
8(16)! < z < 4(20)!
C
20 11
FIGURE 11.4: The P node A has 20 children and the Q node B has 11
children. #(X) for node X is given by z. The lower and upper estimates have
been used for each node.
2n .
q ∈ F r(T ).
We are now ready to pose our original question along the lines of the earlier
one: What is the probability, pr(T ), of a PQ tree T which has n leaf nodes,
labeled by integers 1, 2, . . . , n, being compatible with a random permutation of
1, 2, . . . , n?
For this we compute NT which is the number of permutations that are
compatible with T . Note that
NT = |F r(T )|.
346 Pattern Discovery in Bioinformatics: Theory & Algorithms
n!
NT |F r(T )|
pr(T ) = = . (11.18)
n! n!
An alternative view. Let T be a tree with n leaf nodes. We label the leaf
nodes by integers
1, 2, . . . , n
in the left to right order.5 Let q be a random permutation of integers
1, 2, . . . , n.
See Section 5.2.3 for a definition of random permutation. Then the probability,
pr(T ), of the occurrence of the event
q ∈ F r(T )
is given by
|F r(T )|
pr(T ) = . (11.19)
n!
11.4 Exercises
For the problems below, see Section 5.2.4 for a definition of random
strings and Section 5.2.3 for a definition of random permutations.
5 In fact, the leaves could be labeled in any order (as long as it is a bijective mapping) and
|Σ| = n.
pk = pn−k ,
Exercise 123 (P -arrangement) For k > 1, let P a(k) denote the number
of P -arrangements of size k. Then what is S(k), the number of permutations
(arrangements) of size k that has exactly one nontrivial interval of size l < k?
Hint: An interval of size l can be treated as a single number, that can then
be expanded to admit its own P -arrangement. Then does the following hold?
1. Row (1) shows the P -arrangements of 1, 2 and row(2) shows its inversion.
21 (1)
21 21 (1a)
(inversion)
12 (2)
12 12 (2a)
2. Row (1) shows the P -arrangements of 1..4 and row(2) shows its inver-
sion.
3142 (1)
3142 3142 3142 3142 (1a)
(inversion)
2413 (2)
2413 2413 2413 2413 (2a)
3. When l = 4, then the only possible nested intervals are size 4 and 5.
When l = 3, the number is zero since P a(3) = 0. When l = 2, let
r be the number of intervals, then for each case the possible (nested)
interval sizes are given in the following table. For example when r = 3,
2 < 3 < 5 is a possible configuration of the interval sizes and in the
Permutation Pattern Probabilities 349
2<4<5
2←3←2
21
23 1
231 231
32 41 2 43 1
3241 3241 2431 2431
34 251 4 23 51 2 45 31 25 34 1
34 2 5 1 4 23 5 1 2 45 3 1 2 5 34 1
21
3 12
312 312
4 21 3 41 32
4213 4213 4132 4132
5 32 14 53 12 4 51 34 2 514 23
5 32 1 4 5 3 12 4 5 1 34 2 5 1 4 23
where
x1 = i1 ,
xj = ij − ij−1 + 1, for r ≥ j > 1.
Hint: 1. Use the ideas of Exercise 126. 2. Consider the scenario when xj
and xj+1 are both of size 2. See also Exercise 126(3) for an illustration.
Principle 1 Principle 2
2413 3 12 4
inversions
24153 31524 35142 42513
1. Then the elements 1 and k do not occur together in the smallest interval.
k−2
X
N st′ (k) ≤ S(k − 1, l)(l − 1).
l=2
2. (Case 2): If element k − 1 does occur, 1 does not occur and using state-
ment (2) of Exercise 129, element 0 can be inserted in any of the l − 1
positions. The arrangement of elements 0, 1, . . . , k − 1 is simply renum-
bered to elements 1, 2, . . . , k − 1, k.
Thus
k−2
X
N st′ (k) = S(k − 1, l)(l − 1).
l=2
Hint: In Case 2, is it possible that this arrangement has already been ac-
counted for? Are all the nested interval sequences accounted for?
Hint: How is the counting done if two successive telescopic sizes of the inter-
vals are 2 each? In the arrangements that N st(k) counts, how many are such
that element k occurs in the smallest interval? How are the intervals that are
not strictly nested taken care of?
Exercise 133 Enumerate the frontiers of the PQ tree T shown below. What
is |F r(T )|?
352 Pattern Discovery in Bioinformatics: Theory & Algorithms
C 9
A B 7 8
1 2 3 4 5 6
Hint: Notice that the leaf nodes are already labeled in the left to right order.
Each internal node and the possible arrangements of the children are shown
below.
A B C D
3142 2413 56 65 7A8B B8A7 C9 9C
Exercise 134 Let T be a tree with n leaf nodes which are labeled by integers
1, 2, . . . , n in the left to right order and let q be a random permutation of
integers 1, 2, . . . , n.
1. If T has exactly one Q node and no P nodes, then what is the probability
of the following event:
q ∈ F r(T )?
2. If T has exactly one P node and no Q nodes, then what is the probability
of the following event:
q ∈ F r(T )?
A BC D E K L
F I J
G H
Permutation Pattern Probabilities 353
Exercise 136 Argue that Equation (11.20) is the probability of K > 1 occur-
rences of a structured permutation pattern (PQ tree) T .
Hint: How is the probability space defined for K occurrences?
Chapter 12
Topological Motifs
12.1 Introduction
Due to some unknown reason,1 nature has organized the blue-print of a
living organism along a line. Thus nucleotides on a strand of DNA or amino
acids on a protein sequence (the primary structure) or genes on a chromo-
some are linearly arranged and the study of strings has been an important
component in the general area of bioinformatics.
But sometimes, there is a deviation from this clean organizational simplicity,
for instance, a cell’s metabolic network, as we understand it. A metabolic
pathway is a series of chemical reactions that occur within a cell, usually
catalyzed by enzymes, resulting in the synthesis of a metabolic product that
is stored in the cell. Sometimes, instead of creating such a product, the
pathway may simply initiate another series of reactions (yet another pathway).
Various such metabolic pathways within a cell have a large number of common
components and thus form the cell’s metabolic network. Figure 12.1 shows
an example.
To study and gain an understanding in a domain such as this, one abstracts
the organization of this information as a graph. A graph captures this kind of
complex interrelationships of the entities. Continuing the theme of this book,
we seek the recurring structures in this data.
1 However, speculations abound and there are almost as many theories as there are scientists.
355
356 Pattern Discovery in Bioinformatics: Theory & Algorithms
NG = |V | + |E|.
However, for convenience we denote a graph as G(V, E) with vertex set V and
edge set E. The vertex and edge attribute mappings AV , AE are assumed to
be implicit.
Figure 12.3 gives the exhaustive list of maximal and connected structures
(subgraphs or motifs) that occurs at least two times in the input. Here con-
nected is defined as follows:
For any pair of vertices v and v ′ there is a path that can be obtained
by ignoring the direction of the edge from v to v ′ .
No two adjacent vertices have the same attribute and no two vertices
adjacent to one vertex have the same attribute.
Consider the graph shown in Figure 12.4. Again, the color of the vertex de-
notes its associated attribute and the graph has seven connected components
numbered, for convenience, from (1) to (7).
The exhaustive list of maximal common structures (motifs) on this input
graph is 63 and is shown in Figure 12.5.
f : V1 → V2 ,
(v1 v2 ) ∈ E1
then
(f (v1 )f (v2 )) ∈ E2 .
Two graphs are isomorphic if such a bijection f exists.
A graph G(V, E) is self-isomorphic if f is a bijection as above and there
exists some v such that
v 6= f (v).
We abuse notation here and say that given l > 0, a graph G(V, E) displays
self-isomorphism, if for some nonempty sets
V1 6= V2 ⊂ V
f : V1 → V2 ,
(v1 v2 ) ∈ E
then
(f (v1 )f (v2 )) ∈ E.
Self-isomorphism is an important property of a graph, since it can be partic-
ularly confounding to methods that attempt to recognize common structures
in the graph.
Secretion
granulosa ESTRADIOL
CHOLESTEROL ESTRONE
Aromtase
if in the two occurrences there is at least one edge that is not present in
both the occurrences.
In the figure, we count only distinct occurrences. Thus an occurrence of
5(2), 6, 7(3) is to be interpreted as having two distinct occurrences in compo-
nent (5), one occurrence in component (6) and three distinct occurrences in
component (7).
In this scenario, when is a motif maximal? Usually, a subgraph M ′ of a motif
M is considered nonmaximal with respect to (w.r.t.) M . We use the definition
that takes the occurrences as well into account. Thus the occurrences may
actually determine if a M ′ is maximal or not:
1. If the number of distinct occurrences of M and M ′ differ, then M ′ is
maximal w.r.t. M .
2. However if the number of distinct occurrences are the same then M ′ is
nonmaximal w.r.t. M .
(5) (6)
(7)
FIGURE 12.2: The input graph with 7 connected components where the
edges may be directed and have different attributes. The two attributes are
shown as solid and dashed edges.
M (VM , EM ),
Topological Motifs 361
|Vm |=4
|Em |=5
N4,5 =0
|Vm |=4
|Em |=4
N4,4 =1
1, 3, 7
|Vm |=4
|Em |=3
N4,3 =2
1, 3, 4, 7 1, 2, 3, 7
|Vm |=3
|Em |=3
N3,3 =1
1, 3, 6, 7
|Vm |=3
|Em |=2
N3,2 =5
1, 3, 5, 6, 7 1, 3, 4, 5, 7 1, 2, 3, 6, 7 2, 3, 4, 5, 7 1, 2, 4, 5, 7
|Vm |=2
|Em |=1
N2,1 =4
1, 2, 3, 4, 5, 7 1, 2, 3, 5, 6, 7 1, 2, 3, 5, 6, 7 1, 2, 3, 4, 6, 7
|Vm |=1
|Em |=0
N1,0 =4
1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7
FIGURE 12.3: Maximal (connected) motifs that occur at least two times
in the input graph of Figure 12.2. Nx,y denotes the number of motifs with x
vertices and y edges and each row shows motifs with x number of vertices and y
number of edges. For example Row 2 shows the maximal motifs with 5 vertices
and 4 edges each. The list of numbers below the motif gives the occurrence list,
for example 1, 3, 7 indicates that the motif occurs in components numbered
(1), (3) and (7).
362 Pattern Discovery in Bioinformatics: Theory & Algorithms
(5) (6)
(7)
where
VM = {u1 , u2 , . . . , up } , p ≥ 1
and is said to occur on
Fi : VM → O i ,
such that,
1. for each u ∈ VM ,
att(u) = att(Fi (u)), and
Let the number of such distinct mappings (Fi ’s) be K ′ . Then the number of
occurrences is K ′ and the occurrence lists are
O 1 , O 2 , . . . , OK ′ .
Topological Motifs 363
|Vm |=4
|Em |=5
N4,5 =6
1, 7 2, 7 3, 7 4, 7 5, 7 6, 7
|Vm |=4
|Em |=4
N4,4 =15
1, 3, 7 1, 4, 7 3, 4, 7 4, 5, 7
|Vm |=4
|Em |=3
N4,3 =16
1, 3, 5, 7 1, 3, 4, 7 2, 3, 4, 7 2, 4, 5, 7
|Vm |=3
|Em |=3
N3,3 =4
2, 4, 6, 7 1, 3, 6, 7 1, 4, 5, 7 2, 3, 5, 7
|Vm |=3
|Em |=2
N3,2 =12
1, 3, 4, 5, 7 1, 2, 3, 6, 7 2, 3, 4, 5, 7 1, 2, 4, 5, 7
|Vm |=2
|Em |=1
N2,1 =6
1, 2, 3, 4, 5, 7 1, 2, 3, 5, 6, 7 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 6, 7
|Vm |=1
|Em |=0
N1,0 =4
1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7
FIGURE 12.5: Maximal (connected) motifs that occur at least two times
in the input graph of Figure 12.4. In this example each occurrence of the
motif is in a distinct component of the graph. Nx,y denotes the number of
motifs with x vertices and y edges. Each row shows motifs with x number
of vertices and y number of edges. For example Row 1 shows some maximal
motifs with 4 vertices and 5 edges each.
FIGURE 12.6: Continuing the example of Figure 12.4. Any two edges
many not be adjacent, leading to disconnected structures, hence the three
structures are not motifs.
364 Pattern Discovery in Bioinformatics: Theory & Algorithms
Input
vertices edges - input size
|V | |E| |V | + |E|
28 36 - 64
Output
vertices edges number occurrences output size
|Vm | |Em | l K′ lK ′ (|Vm | + |Em |)
4 5 6 2 108
4 4 15 3 360
4 3 16 4 448
3 3 4 4 96
3 2 12 5 300
2 1 6 6 108
1 0 4 7 28
63 1448
(5) (6)
(7)
FIGURE 12.9: The input graph with with only two (vertex) attributes.
FIGURE 12.10: Consider the graph of Figure 12.9. The top 3 components
are isomorphic to each other and so are the bottom three components.
366 Pattern Discovery in Bioinformatics: Theory & Algorithms
|Vm |=4
|Em |=5
N4,5 =2
5, 7(3) 6, 7(3)
|Vm |=4
|Em |=4
N4,4 =2
5(2), 6(2), 7(3) 5, 6, 7(3)
|Vm |=4
|Em |=3
N4,3 =3
5(2), 6(2), 7(6) 5(4), 6(2), 7(6) 5(2), 6, 7(3)
|Vm |=3
|Em |=3
N3,3 =2
5, 7 5, 6(2), 7(3)
|Vm |=3
|Em |=2
N3,2 =2
5(3), 6, 7(3) 5(2), 6(3), 7(6)
|Vm |=2
|Em |=1
N2,1 =2
5(2), 6(3), 7(3) 5(3), 6(2), 7(3)
|Vm |=1
|Em |=0
N1,0 =2
5(3), 6(3), 7(3) 5, 6, 7
FIGURE 12.11: Maximal (connected) motifs that occur at least two times
in the input graph of Figure 12.9. Since the components (1), (4) and (5)
are isomorphic (or identical) and so are components (2), (3) and (6), the
occurrences are listed only for components (5), (6) and (7) for each motif.
Topological Motifs 367
LU = {Fi (U ) | 1 ≤ i ≤ K ′ }.
If U is a singleton set,
U = {uj },
then its location list may also be written as
Luj .
LU = {F1 (U ), F2 (U ), . . . , FK ′ (U )}.
12.3.1 Maximality
Consider the input graph with two connected components shown in Fig-
ure 12.12(a). Let the quorum be 2. A motif
M (VM , EM )
with
VM = {u1 , u2 , u3 }
is shown in Figure 12.12(b). The two occurrences of the motif with
1. att(u1 ) = blue,
2. att(u2 ) = green,
3. att(u3 ) = red,
are given as follows:
1. O1 = {v2 , v3 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v1 and
368 Pattern Discovery in Bioinformatics: Theory & Algorithms
v1 v4 v9
v6 v10
v5
v11
v2 v3 v7 v8
(a) Input graph with two connected components.
u4 u5
u4
u3 u3
u3
u1 u2 u1 u2 u1 u2
FIGURE 12.12: (a) The different attributes of the nodes (vertices) are
shown in different colors. v1 to v5 form one connected component and the
other is formed by v6 to v11 . (b) and (c) show two motifs that occur once in
each connected component of input graph. (d) The maximal version of these
motifs, i.e., no more vertices or edges can be added to this motif.
1. Lu1 = {v2 , v7 },
O 1 , O 2 , . . . , Ol ,
Fi : VM → O i .
Topological Motifs 369
1. vi 6∈ Oi , vi′ ∈ Oi ,
2. att(vi ) = a, for some attribute a, and,
3. (vi vi′ ) ∈ E.
12.4.1 Occurrence-isomorphisms
We next consider a slightly modified input graph shown in Figure 12.13(a).
Here the vertex attributes black and white have both been replaced by red.
How does the problem scenario change?
v1 v4 v9
v6 v10
v5
v11
v2 v3 v7 v8
(a) Input graph.
u1 u4 u5 u1 u4
u5 u4 u5
u3
u6
u2 u3 u1 u2 u2 u3
FIGURE 12.13: (a) The input graph with two connected components. (b)
and (c) show motifs that occur at least twice on the graph. (d) A structure
that is not a motif in (a).
att(u3 ) = green.
1. O1 = {v1 , v2 , v3 , v4 , v5 }, with
F1 (u1 ) = v1 , F1 (u2 ) = v2 , F1 (u3 ) = v3 , F1 (u4 ) = v4 , F1 (u5 ) = v5 .
2. O2 = {v1 , v2 , v3 , v4 , v5 }, with
F1 (u1 ) = v1 , F1 (u2 ) = v2 , F1 (u3 ) = v3 , F1 (u4 ) = v5 , F1 (u5 ) = v4 .
When attributes of two or more vertices of the motif are identical, sometimes
they can be mapped to a fixed set of vertices of the input graph in combi-
natorially all possible ways. For example u4 and u5 of the motif are mapped
onto the pair v4 and v5 in two possible ways (given by, F2 and F3 ). Similarly,
u4 and u5 are mapped to any two of v9 , v10 and v11 in six possible ways.
We term this explosion in the number of distinct mappings as combinatorial
explosion due to occurrence-isomorphism.
att(u1 ) = blue,
1. O1 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v4 , F1 (u4 ) = v5 , F1 (u5 ) = v1 .
2. O2 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v4 , F1 (u4 ) = v1 , F1 (u5 ) = v5 .
3. O3 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v1 , F1 (u4 ) = v4 , F1 (u5 ) = v5 .
4. O4 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v1 , F1 (u4 ) = v5 , F1 (u5 ) = v4 .
5. O5 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v5 , F1 (u4 ) = v1 , F1 (u5 ) = v4 .
6. O6 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v5 , F1 (u4 ) = v4 , F1 (u5 ) = v1 .
3 Although usually a graph is written as G(V, E), here we use a motif notation M (VM , EM )
for the graph to emphasize the fact that indistinguishability of vertices is primarily associ-
ated with motifs (but evidenced in the input graph).
Topological Motifs 373
U = {u4 , u5 }
and one recovers LU from LcU by taking all two (the smallest cardinality of
the sets in LU ) element subsets of the sets in LcU . We call
M (VM , EM ),
if
then U1 and U2 are called compact vertices. Further, each of the following
holds.
374 Pattern Discovery in Bioinformatics: Theory & Algorithms
2. U1 ∩ U2 = ∅.
This naturally leads to the compact notation for the motif where the vertex
set is a collection of compact vertices and compact edges defined on them. For
convenience, we represent a compact motif as C with the following notation
C(VC , EC ) ≡ M (VM , EM ),
where UC is the set of compact vertices and EC the set of compact edges.
It is important to note that two compact vertices may have a nonempty
intersection. The second example shown in Figure 12.14 illustrates such a
nonempty intersection.
Vx = {v ∈ V | att(v) = x}.
Topological Motifs 375
u5
u4
u3
u1 u2
M (VM , EM ) C(VC , EC )
VM = {u1 , u2 , u3 , u4 , u5 } VC = {U1 , U2 , U3 }
EM = {u1 u2 , u2 u3 , u2 u4 , u2 u5 } EC = {U1 U2 , U2 U3 }
u1 u4
u5
u6
u2 u3
M (VM , EM ) C(VC , EC )
VM = {u1 , u2 , u3 , u4 , u5 , u6 } VC = {U1 , U2 , U3 , U4 }
EM = {u1 u2 , u2 u3 , u3 u4 , u3 u5 , u3 u6 } EC = {U1 U2 , U2 U3 , U3 U4 }
FIGURE 12.14: Two examples of motifs with their usual notation and the
corresponding compact notation. Notice that in the motif in (b), two compact
vertices U1 and U4 have a nonempty intersection, i.e., U1 ∩ U4 = {u1 }.
376 Pattern Discovery in Bioinformatics: Theory & Algorithms
THEOREM 12.1
(Conjugate maximality theorem) The conjugate of a maximal location
list is maximal.
Topological Motifs 377
2. [
f lat(L) = L.
L∈L
4. All the vertices in f lat(L) have the same attribute given by att(L).
5. For 1 ≤ l ≤ ℓ, let
ELl = {(v1 v2 ) ∈ E | v1 , v2 ∈ Ll }.
Then
GLl (Ll , ELl )
is called the induced subgraph on Ll . Further, if for each u1 , u2 ∈ Ll ,
(u1 u2 ) ∈ ELl , then the graph GLl is called a clique.
clq(L) is an indicator that is set 1 if all the ℓ induced subgraphs are
cliques. Formally,
1, if for each L ∈ L, GL (L, EL ) is a clique,
clq(L) =
0, otherwise.
4 For
example, compact list LU = {{v1 , v2 , v3 }, {v4 , v5 }} is written as LU = {L1 , L2 }, where
L1 = {v1 , v2 , v3 } and L2 = {v4 , v5 }.
Topological Motifs 379
U1 = {u4 , u5 }
U2 = {u3 }
and vice-versa. The compact location lists of U1 and U2 along with their
characteristics is given below.
(b) LU2 = {{v3 }, {v8 }} or simply Lu3 = {v3 , v8 } with att(LU2 )= green.
U3 = {u3 , u4 , u5 }
U4 = {u2 }
(b) LU4 = {{v3 }, {v8 }} or simply Lu2 = {v3 , v8 } with att(LU4 )= green.
380 Pattern Discovery in Bioinformatics: Theory & Algorithms
L 3 = L 1 ∩c L 2 ,
L 1 ∩c L 2 ∩c . . . ∩c L p ,
L 3 = L 1 ∪c L 2 ,
L 1 ∪c L 2 ∪c . . . ∪c L p ,
L3 = L1 \c L2 ,
L3 = L1 \c L2
= L1 \c (L1 ∩c L′1 )
= {L1 \ L2 | L1 ∈ L1 , L2 ∈ L2 and L1 ∩ L2 6= ∅}.
L1 \c L2
d = discSz(L) and
dmax = max (|Li |) .
Li ∈L
discSz(L) > d.
2. Let
Li (⊂ V ),
where all the vertices have the same attribute. If
U ⊂ Li
Clearly
clq(clique(L)) = 1.
Note that Enrich(L) results in possibly more than one new list whereas
clique(L) results in at most one new list.
We next make another observation that is important for the algorithm that
is discussed in the later sections. It states that conjugate lists can be computed
by simply taking appropriate subsets of other known compact lists rather than
using the original graph G(V, E) as in Equation (12.1).
LEMMA 12.1
(Intersection conjugate lemma) For lists
L1 , L2 , . . . , Lp
For
1 ≤ j ≤ p,
COROLLARY 12.1
(Subset conjugate lemma) For an attribute a, let La be a conjugate of L.
Define a relation R as follows:
L is a conjugate of La
(L, La ) ∈ Rj ⇔
for L ∈ L and La ∈ La .
Let
L′ ⊂c L.
Then the conjugate, L′a , of each element L′ ∈ L′ is given as follows.
′
[ L ⊂ Li ∈ L and
L′a = Lai , where
(Li , Lai ) ∈ R.
i
Further, if L, La and L′ are maximal, then this conjugate, L′a , is the same
set given by
conja (L′ ) (of Equation (12.1))
that is obtained directly using the input graph G(V, E).
6 The two ‘unions’ here are due to the fact that in the intersection set L′ , more than one
7 To avoid confusion with attributes on vertices we call this the ‘edge type’ instead of ‘edge
attribute’.
Topological Motifs 385
2 1
4 5
3
(a) Meta-graph G(L, E)
2 1 2
5 4 5
3 3
C1 (VC1 , EC1 ) C2 (VC2 , EC2 )
u1 u4 u5
u5 u4
u3
u2 u3 u1 u2
FIGURE 12.15: (a) A meta-graph where the the ‘forbidden’ edge is shown
in bold (connects an inconsistent pair), the ‘subsume’ edge is shown dashed
(connects a complete intersection pair) and the remaining edges are ‘link’
edges (denoting the conjugacy relation). The correspondence between the
node numbers shown here to the location lists shown in Figure 12.17 are as
follows: 1 ↔ L1 , 2 ↔ L2 , 3 ↔ L3 , 4 ↔ L4 , 5 ↔ L4 \c L1 . The singleton lists
are not shown here. (b) and (c) show the MCCSs and the maximal motifs
respectively.
386 Pattern Discovery in Bioinformatics: Theory & Algorithms
v1 v4 v9 v1 v4 v9
v6 v10 v6 v10
v5 v5
v11 v11
v2 v3 v7 v8 v2 v3 v7 v8
(a) Input graph G1 . (b) Input graph G2 .
B1 B2
v1 v2 v3 v1 v2 v3
v4 v3 v4 v5 v3
v5 v3 v5 v4 v3
v6 v7 v6 v7
v9 v8 v9 v10 v8
v10 v8 v10 v9 v11 v8
v11 v8 v11 v10 v8
v2 v1 v3 v2 v1 v3
v7 v6 v8 v7 v6 v8
v3 v1 v4 v5 v2 v3 v1 v4 v5 v2
v8 v9 v10 v11 v7 v8 v9 v10 v11 v7
FIGURE 12.16: The two running examples G1 and G2 with their adja-
cency matrices. Each graph has two connected components.
Topological Motifs 387
Initialization (Linit )
L1 = {v1 , v6 }
m l l
L2 = {v2 , v7 }
m l l
L3 = {v3 , v8 }
m l l
L4 = {{v1 , v4 , v5 }, {v9 , v10 , v11 }}
L1 and L4 are inconsistent.
Iterative Step
L4 \c L1 = {{v4 , v5 }, {v9 , v10 , v11 }}
m l l
L3 = {v3 , v8 }
L1 and L4 \c L1 are consistent.
L 1 ∩c L 4 = {v1 }
m l
L′2 = {v2 }
m l
L′3 = {v3 }
m l
L′4 = {{v1 , v4 , v5 }}
L1 ∩c L4 and L′4 are consistent.
L1 \c L4 = {v6 }
m l
L′2 = {v7 }
m l
L′3 = {v8 }
m l
L′4 = {{v9 , v10 , v11 }}
L1 \c L4 and L′4 are consistent.
FIGURE 12.17: The solution for the input graph shown in Fig-
ure 12.16(a). See text for details.
388 Pattern Discovery in Bioinformatics: Theory & Algorithms
Initialization (Linit )
L1 = { v1 , v6 }
m l l
L2 = { v2 , v7 }
m l l
L3 = { v3 , v8 }
m l l
L4 = { {v1 , v4 , v5 }, {v9 , v10 , v11 } }
L1 and L4 are inconsistent.
Iterative Step
L4 \c L1 = { {v4 , v5 }, {v9 , v10 , v11 } }
m l l
L3 = { v3 , v8 }
L1 and L4 \c L1 are consistent.
clique(L4 ) =
L6 = { {v4 , v5 }, {v9 , v10 }, {v10 , v11 }}
m l l l
L3 = { v3 , v8 , v8 }
L1 and L6 are consistent.
FIGURE 12.18: The solution for the input graph shown in Fig-
ure 12.16(b). Note that singleton location lists are not shown. See text for
details.
Topological Motifs 389
L1 \c L2 and L2 \c L1 ,
which excludes these common (offending) vertices. It is easy to see that for
an inconsistent pair L1 and L2 ,
1. the two must lie on a cycle (of link edges) in the meta-graph G(L, E)
and
2. att(L1 ) = att(L2 ).
8 Most patternists (scientists who specialize in pattern discovery in data) with an honest
regard for mathematics, shudder at the thought of patterns in graphs because of the sheer
promiscuity that the vertices display in terms of how many neighbors (partners) each can
sustain. We also get no respite from the consequences of this. So, it is not surprising that
both graph-theoretic and set-theoretic tools are used for this (that of connectedness in the
meta-graph and complete intersection of compact lists).
390 Pattern Discovery in Bioinformatics: Theory & Algorithms
Before we get down to the task of computing the compact motifs, we sum-
marize the important observations as a theorem.
THEOREM 12.2
(Maximal begets maximal theorem) If L1 and L2 are maximal, then the
following statements hold.
3. clique(L1 ) is maximal.
4. L1 ∩c L2 is maximal.
THEOREM 12.3
(Compact motif theorem) Given an input graph, let its meta-graph be
given as
G(L, E).
A subgraph
C(VC , EC )
1. (connected) For any two vertices in VC there is a path between the two
where each edge is of type ‘link’.
(L1 L2 ) ∈ EC
Next,
C(VC , EC )
defines a (maximal) compact motif on input graph G(V, E).
C(VC , EC ).
We wish to compute
M (VM , EM )
from C(VC , EC ). See Figure 12.14 for the notation and concrete examples.
For a location list Li ∈ VC , let
att(Li ) = ai and
discSz(Li ) = di .
Also if
clq(Li ) = 1,
then an edge is introduced in the motif between every pair of vertices, i.e., for
each 1 ≤ j < k ≤ di ,
(uij uik ) ∈ EM .
However, if the edge type of (Li Lj ) is ‘subsume’, i.e., one is a complete subset
of the other, then without loss of generality let
di ≤ dj ,
What’s next?
Given an input graph G(V, E) and a quorum K, we have stated the need
to discover all the maximal topological motifs (and their occurrences in the
graph) that satisfy this quorum constraint. We have then painstakingly con-
vinced the reader that we want these maximal motifs to be in the compact
form. The occurrences of these compact motifs on the input graph is described
by compact lists.9
Now that we know ‘what’ we want, we must next address the question of
‘how’. Thus the next natural step is to explore methods to compute these
compact motifs and lists from the input.
9 Note however that the nature of the beast is such that even compact lists may not save
the day: see Exercise 154 based on the example of Figure 12.4.
Topological Motifs 393
att(Lai ) = ai ,
att(Laj ) = aj ,
Conjai (Laj ) = Lai ,
Conjaj (Lai ) = Laj .
Ensuring the maximality of these lists is utterly simple since it can almost
be read off the incidence matrix B. This is best understood by following a
simple concrete example. We show two such examples in Figure 12.16. But
we must also avoid overcounting, as discussed in the following paragraphs.
If f lat(L1 ) = f lat(L2 ),
then discSz(L1 ) must be different from discSz(L2 ).
G({v1 , v2 }, {(v1 v2 )}
We first compute the maximal subset L′5 = Enrich(L5) and compute its
conjugate directly as a subset of L5 as follows:
with
red X L1 ⇔ L2 L4 ⇔ L3
blue - X L2 ⇔ L3
green - - X
Thus
Linit = {L1 , L2 , L3 , L4 }
and these lists, along with the conjugacy relations, are shown in Figure 12.17.
Next, consider the example G2 in Figure 12.16(b). Again, for each pair of
attributes, the maximal lists are simply ‘read off’ the adjacency matrix B2 as
follows:
red blue green
red L5 ⇔ L5 L1 ⇔ L2 L4 ⇔ L3
blue - X L2 ⇔ L3
green - - X
396 Pattern Discovery in Bioinformatics: Theory & Algorithms
and these lists, along with the conjugacy relations, are shown in Figure 12.18.
S = C i1 ∩ c C i2 ∩ c . . . ∩ c C ip ,
we denote by
IS = {i1 , i2 , . . . , ip }.
Further, IS is maximal i.e., there is no I ′ with
IS ( I ′ ,
such that
Putting it all together. These steps are carefully integrated into one co-
herent procedure outlined as Algorithm (12).
Here InitMetaGraph(L) is a procedure that generates the meta-graph given
the collection of lists, L, and their conjugates. See Section 12.4.11 for details
on detecting inconsistencies in a connected component of the meta-graph.
Induct(L, p, L1 , . . . , Lp ) is a procedure that introduces the new list L to
the collection. Further, L ⊂c L1 , . . . , Lp and the conjugates of these p lists
are used to compute the conjugates of L.
The remainder of the algorithm is self-explanatory. Figures 12.17 and 12.18
give the solution to the input graphs of Figure 12.16.
Consider Figure 12.17. See also Figure 12.15 for the meta-graph. In the
connected component (to avoid clutter, we show only the nodes)
(L1 , L2 , L3 , L4 , L4 \c L1 ),
L1 and L4 are inconsistent. Thus the two maximal subgraphs that are con-
sistent are
398 Pattern Discovery in Bioinformatics: Theory & Algorithms
1. (L1 , L2 , L3 , L4 \c L1 ) and
2. (L2 , L3 , L4 , L4 \c L1 ).
Note that L4 \c L1 is a complete subset of L4 . Further,
(L1 , L2 , L3 )
(L1 , L2 , L3 , L4 , L4 \c L1 , L6 ),
//input G(V, E)
DiscoverLists(L) //output L
{
Compute Linit
L ← Linit
InitMetaGraph(L)
Lnew ← Linit
WHILE (Lnew 6= ∅)
FOR EACH L ∈ Lnew {
FOR EACH L′ ∈ Enrich(L)
Induct(L′ , 1, L)
L ← clique(L); Induct(L′ , 1, L)
′
}
L ← L ∪ Lnew
Refine(L)
FOR EACH (L,p, L1 , . . . , Lp ) computed in Refine
Induct(L,p, L1 , . . . , Lp ) // L is L1 ∩c . . . ∩c Lp
ENDWHILE
}
Induct(L,p, L1 , . . . , Lp )
{
IF L =
6 ∅ THEN
IF L 6∈ L THEN
Add L to Lnew
Add vertex L to meta-graph
FOR 1 ≤ i ≤ p
IF Li and L are inconsistent THEN
Induct(L \c L′ , 1, L); Induct(L′ \c L, 1, L′ )
FOR EACH conjugate (using the p lists) L′ of L
IF L′ 6∈ L THEN
Add L′ to Lnew
Add vertex L′ & edge (LL′ ) to meta-graph
}
G(V, E) with labeled vertices and edges, the task is to discover at least K
subgraphs that are topologically identical in G. Such subgraphs are termed
topological motifs.
It is very closely related to the classical subgraph isomorphism problem
defined as follows [GJ79]:
All the three problems are NP-complete: each can be transformed from the
problem of finding maximal cliques 10 in a graph. The problem addressed in
this chapter is similar to the latter two problems. However our interest has
been in finding at least K isomorphs and all possible such isomorphs.
12.7 Applications
Understanding large volumes of data is a key problem in a large number
of areas in bioinformatics, and also other areas such as the world wide web.
Some of the data in these areas cannot be represented as linear strings, which
have been studied extensively with a repertoire of sophisticated and efficient
algorithms. The inherent structure in these data sets is best represented as
graphs. This is particularly important in bioinformatics or chemistry since it
12.8 Conclusion
The potential of an effective automated topological motif discovery is enor-
mous. This chapter presents a systematic way to discover these motifs using
compact lists. Most of the proofs of the lemmas and theorems presented in
this chapter are straightforward. The only tool they use is ‘proof by contra-
diction’, hence they have been left as exercises for the reader.
One of the burning questions is to compute the statistical significance of
these motifs. Can compact motifs/lists provide an effective and acceptable
method to compute the significance of these complex motifs? We leave the
reader with this tantalizing thought.
12.9 Exercises
Exercise 137 (Combinatorics in graphs) Consider the graph of Figure 12.4
with quorum = 2. Notice that every maximal motif occurs in component la-
beled 7. Thus the number of distinct collection of components is
26 − 1,
ignoring the empty set. Since there are four vertices in each component with
distinct attributes, the number of location lists of vertices can be estimated as
4(26 − 1) = 252.
However, this number is 212 as shown in Figure 12.8. How is the discrepancy
explained?
Hint: Notice that the motifs are (1) connected, (2) maximal and (3) satisfy
a given quorum. Thus the numbers are not captured by pure combinatorics,
although they are fairly close. The number of distinct motifs can be counted
by enumerating i edges, 1 ≤ i ≤ 6, out of 6 possible edges in a motif. Recall
Topological Motifs 403
that quorum is 2.
6
=6 (12.5)
5
6
= 15 (12.6)
4
6 16
= (12.7)
3 4
6
− 3 = 12 (12.8)
2
6
=6 (12.9)
1
6
4 =4 (12.10)
0
• (Equation 12.10): Since each connected component of the graph has four
distinct colored vertices, Equation (12.10) shows a count of 4 × 1 = 4.
2. Identify the bijections (f ’s) that demonstrate that components (1), (4)
and (5) are isomorphic in Figure 12.10.
Exercise 139 (Motif occurrence) Consider the input graph in (a) below
with 10 vertices and two distinct attributes. The attribute of a vertex is de-
noted by its shape in this figure: ‘square’ and ‘circle’. Let quorum K = 2. A
motif is shown in (b). In (1)-(10) the occurrences of the motif on the graph
are shown in bold.
404 Pattern Discovery in Bioinformatics: Theory & Algorithms
v1 v2 v3 v4 v5 u1 u2 v1 v2 v3 v4 v5
v6 v7 v8 v9 v10 u3 u4 v6 v7 v8 v9 v10
(a) (b) (1)
v1 v2 v3 v4 v5 v1 v2 v3 v4 v5 v1 v2 v3 v4 v5
1. For each occurrence i: (a) What is Oi ? (b) Define the map Fi (of
Definition (12.1)).
3. Obtain the compact notation of the motif and the location lists of the
compact vertices.
Exercise 140 (Motif occurrence) Consider the input graph of Problem (139)
and let quorum K = 2. (a) below shows a motif with six vertices. The occur-
rences (in bold) of the motif are shown in (1)-(5). The dashed-bold indicates
multiple occurrences shown in the same figure: the occurrence of the motif is
Topological Motifs 405
to be interpreted as all the solid vertices and edges and one of the dashed edges
and the connected dashed/solid vertex.
u1 u2 u3 v1 v2 v3 v4 v5 v1 v2 v3 v4 v5
u4 u5 u6 v6 v7 v8 v9 v10 v6 v7 v8 v9 v10
(a) (1) (2)
v1 v2 v3 v4 v5 v1 v2 v3 v4 v5 v1 v2 v3 v4 v5
3. Obtain the compact notation of this motif and the location lists of the
compact vertices.
Hint: Note that the motif edge (u1 u5 ) must also be represented in the com-
pact motif notation giving at least two compact vertices U1 = {u5 } and
U2 = {u5 , u6 } although U1 ⊂ U2 .
A A1
z x z1 x1 A1z1 A1x1
y y Bz1 Cx1
B C B C
By Cy
z x z2 x2
Bz2 Cx2
A A2 A2z2 A2x2
Input graph G. 1. Annotate G. 2. Generate V ′ .
2. (Generate V ′ ): For each vertex with attribute Aj and incident edge with
attribute xi , create a node with attribute Aj xi in G′ .
A
A1z1 A1x1 P Q z x
Bz1 Cx1 S R y
B C
By Cy T U
z x
Bz2 Cx2 S R
A2z2 A2x2 P Q A
3. Generate E ′ 4. Node labels in G′ . 5. Input graph G.
(A maximal motif in G′ and G)
3. (Generate E ′ ): For each pair of nodes with labels .yk and .yk , introduce
an undirected edge. Similarly for each pair of nodes with labels Ai . and
Ai ., introduce an undirected edge.
4. (Node labels in G′ ): Two node labels of the form Aj1 yk1 and Aj2 yk2 are
deemed to have the same node attribute in G′ .
A A1
z x z1 x1 z1A1x1
y y yBz2
B C B C
yBz1 x2Cy
z x z2 x2
A A2 x1Cy z2A2x2
Input graph G. 1. Annotate G. 2. Generate V ′ .
A
z1A1x1 P z x
yBz2 Q
B y C
yBz1 x2Cy Q R
z x
x1Cy z2A2x2 R P A
3. Generate E ′ 4. Node labels in G′ . 5. Input graph G.
(A maximal motif in G′ and G)
3. (Generate E ′ ): For each pair of nodes with labels ··yk and yk ··, introduce
an undirected edge.
4. (Node labels in G′ ): Two node labels of the form xi1 Aj1 yk1 and xi2 Aj2 yk2
are deemed to have the same node attribute in G′ .
Show that the same property holds even when motif M (VM , EM ) is not max-
imal.
Exercise 144 Using the definition of the intersection of two compact lists,
define the intersection of p compact lists.
408 Pattern Discovery in Bioinformatics: Theory & Algorithms
Exercise 145 Using the definitions of the set operations on compact lists,
show that given compact lists L1 , L2 and L3 , if
L3 ⊂c L1 , L2 ,
then
L3 ⊂c (L1 ∩c L2 ).
1. L ∪c L =c L.
2. L ∩c L =c L.
3. L \ L =c ∅.
Exercise 147 Given two compact lists L1 and L2 , prove the following state-
ments.
(b) Can you relax the definition of intersection ∩c , to ∩new in such a manner
that an intersection is not necessarily maximal, i.e., for maximal sets
L1 and L2 , L1 ∩new L2 may not be maximal.
Exercise 150 Construct an example to show that the enrich operation (Sec-
tion 12.4.8) is essential to obtain all the maximal topological motifs.
Topological Motifs 409
Hint: Consider the following input graph with three connected components.
Exercise 151 Let L0 ∈ Linit with att(L0 ) = att(L) and discSz(L0 ) > 1.
Then, show that
clique(L) = L ∩c L0 .
Va = {v ∈ V | att(v) = a}.
Further, let the input graph be such that the induced subgraph on Va has k
cliques with the following sizes of the cliques:
1 < d1 ≤ d2 ≤ . . . ≤ dk .
Does there exist L0 ∈ Linit with discSz(L0 ) > 1? Why? If yes, then deter-
mine the following characteristics of L0 : f lat(L0 ), discSz(L0 ), att(L0 ) and
clq(L0 ).
Hint: See Figure 12.13. Is the structure in (d) a topological motif in the input
graph in (a)? Are the motifs in (b) and (c) maximal by the new definition?
Hint: Given a graph, an independent set is a subset of its vertices that are
pairwise not adjacent. In other words, the subgraph induced by these vertices
has no edges, only isolated vertices. Then, the independent set problem is as
follows: Given a graph G and an integer k, does G have an independent set
of size at least k ? The corresponding optimization problem is the maximum
independent set problem, which attempts to find the largest independent set
in a graph. The independent set problem is known to be NP-complete.
The connectedness of a graph, on the other hand, can be computed in linear
time using a BFS or a DFS traversal.
r r g g d d r g d
1 • • • • • • • • • • • • • • •
2 • • • • • • • • • •
3 • • • • • • • • • •
⇒ ⇒ ⇒ ...
b 4 • • • • • • • • • • • • • • •
5 • • • • • • • • • •
6 • • • • • • • • • • • • • • •
7 • • • • • • • • • • • • • • • • • • • • • •
r r b b d d r b d
1 • • • • • • • • • • • • • • •
2 • • • • • • • • • •
3 • • • • • • • • • • • • • • •
⇒ ⇒ ⇒ ...
g 4 • • • • • • • • • •
5 • • • • • • • • • • • • • • •
6 • • • • • • • • • •
7 • • • • • • • • • • • • • • • • • • • • • •
r r b b g g r b g
1 • • • • • • • • • •
2 • • • • • • • • • • • • • • •
3 • • • • • • • • • • • • • • •
⇒ ⇒ ⇒ ...
d 4 • • • • • • • • • •
5 • • • • • • • • • •
6 • • • • • • • • • • • • • • •
7 • • • • • • • • • • • • • • • • • • • • • •
Topological Motifs 413
Exercise 155 Consider the following two examples where the vertices in the
graphs have only one attribute. Assume quorum K = 2. Compute all the
maximal motifs in their compact form.
v1 v4
v1 v4
v5
v3 v5 v2 v3
v2
u2 u3 u2 u3 u2
Exercise 156 ∗∗ Let G be a graph and let M be the collection of all topological
motifs on G that satisfy a quorum K. Design an algorithm to update M when
a new vertex v and all its incident edges are added to G.
Hint: This is also known as an incremental discovery algorithm.
Topological Motifs 415
Comments
The reader might find this the most challenging chapter in the book. But
he/she can seek solace in the fact that even experts have had difficulty with
this material. However, the ideas presented in this chapter are simple, though
perhaps not obvious.
It is amazing to what lengths scientists are willing to inconvenience them-
selves to gain only a sliver of understanding of the nature of biology–brute-
force enumeration of topological motifs have been used, albeit for ones with
very few vertices.
The natural question that might arise in a curious reader’s mind: Why not
use multiple BFS (breadth first search) traversals to detect the recurring mo-
tifs? In fact, a survey of literature will reveals that such approaches have been
embraced. However, the problem of explosion due to occurrence-isomorphism
will cripple such a system in data sets with rampant common attributes. Of
course, it is quite possible to generate instances of input that would be debil-
itating even to the approach presented here.
Chapter 13
Set-Theoretic Algorithmic Tools
13.1 Introduction
Time and again, one comes across the need for an efficient solution to a
task that leaves one with an uncomfortable sense of déjà vu. To alleviate such
discomfort, to a certain extent, most compilers provide libraries of routines
that perform oft-used tasks. Most readers may be familiar with string or
input-output libraries. Taking this idea further, packages such as Maxima, a
symbolic mathematics system, R, a language and environment for statistical
computing,1 and other such tools provide invaluable support and means for
solving difficult problems.
In the same spirit what are the nontrivial tasks, requiring particular atten-
tion, in the broad area of pattern discovery that one encounters over and over
again?
This chapter discusses sets, their interesting structures (such as orders,
partial orders) and efficient algorithms for simple, although nontrivial, tasks
(such as intersections, unions). This book treats lists as sets. Thus a location
list, which is usually a sorted list of integers (or tuples), is treated as a set for
practical purposes.
417
418 Pattern Discovery in Bioinformatics: Theory & Algorithms
Σ = {σ1 , σ2 , . . . , σL },
where
L = |Σ|.
Let S1 and S2 be two nonempty sets. Then only one of the following three
holds:
1. S1 and S2 are disjoint if and only if
S1 ∩ S2 = ∅.
For example,
S1 = {a, b, c},
S2 = {d, e},
S1 ⊆ S2 .
For example,
S1 = {a, b}
is contained in
S2 = {a, b, d, e},
since
{a, b} ⊂ {a, b, d, e}.
3. S1 and S2 straddle if and only if the two set differences are nonempty,
i.e.,
S1 \ S2 6= ∅ and
S2 \ S1 6= ∅.
For example,
straddle since
{a, b, c, e} \ {a, b, d} = {c, e}, and
{a, b, d} \ {a, b, c, e} = {d}.
In other words, for some x, y ∈ Σ,
x ∈ S1 \ S2 and
y ∈ S2 \ S1 .
S2 S1 ∈ E.
The graph G(S, E) is called S’s partial order (graph). Let
S2 S1 ∈ E,
Then the following terminology is used.
1. S1 is called the child of S2 .
2. S2 is called the parent of S1 .
3. If two nodes S1 and S3 have a common parent S2 , i.e.,
S2 S1 , S2 S3 ∈ E,
then S1 and S3 are called siblings.
420 Pattern Discovery in Bioinformatics: Theory & Algorithms
LEMMA 13.1
(The descendent lemma) If S1 is a descendent of S2 , then
S1 ( S2 .
G(S, E).
S2 S1 ∈ E r
S ′ is a descendant of S2 , and
S2 S1 ∈ Er ⇔ there is no S ′ ∈ S such that
S1 is a descendant of S ′ .
The transitive reduction2 of G(S, E) is written as
G(S, Er ).
For convenience, in the rest of the chapter this is also called the reduced partial
order (graph) of S. Figure 13.1 gives a concrete example.
a,b,c,d a,b,c,d
a,b,c a,b,c
FIGURE 13.1: Here S = {{a, b, c, d}, {a, b, c}, {a, b}, {b, c}}. Note that
some edges are missing in (b), yet C(G(S, E)) = C(G(S, Er )).
Note that the reduced partial order encodes exactly the same subset infor-
mation as the partial order but with possibly fewer edges, i.e.,
E ⊂ Er .
LEMMA 13.2
(Unique transitive reduction lemma) Er , the smallest set of edges sat-
isfying
C(G(S, E)) = C(G(S, Er )),
is unique.
Gstraddle (VS , ES ).
Thus the straddle graph is defined on some subset of S that straddle. Fig-
ure 13.2 shows a concrete example. Let
The reduced partial order graph, G(S, Er ) is shown in (a); (b) shows the edges
that connect nodes with common children (these edges are shown as dashed
edges), and (c) shows the two nonsingleton connected straddle graphs. The
singleton straddle graphs (graphs whose node set V has only one element)
are:
LEMMA 13.3
(Unique straddle graph lemma) If
S, S ′ ∈ VS ,
then
Gstraddle (VS , ES ) = Gstraddle (VS ′ , ES ′ ).
This can be verified and we leave that as Exercise 160 for the reader.
Set-Theoretic Algorithmic Tools 423
a,b,c,d,e a,b,c,d,e
a,b,c a,b,c
b,c c,d
a,b b,c c,d c,e a,b c,e
a b c d e a b c d e
(a) Reduced partial order. (b) Edges (dashed) connecting nodes
that share a common child.
a,b,c,d a,b,c,e
c,d
b,c
a,b c,e
(c) The two connected straddle graphs.
FIGURE 13.2: A reduced partial order graph with dashed edges between
nodes that have common children.
B∩ (S) = S1 ∩ S2 ∩ . . . ∩ Sl Si ∈ S, 1 ≤ i ≤ l, for some l ≥ 1 .
In other words, this is the collection of all possible intersection of the sets.
For example, let
Then,
B∩ (S) = { {e}, ({a, b, c, e} ∩ {e, f })
{f }, ({b, c, f } ∩ {e, f })
{b, c}, ({a, b, c, d} ∩ {a, b, c, e} ∩ {b, c, f }, or
{a, b, c, d} ∩ {b, c, f }, or
{a, b, c, e} ∩ {b, c, f })
{a, b, c}, ({a, b, c, d} ∩ {a, b, c, e})
{e, f }, (in S)
{b, c, f }, (in S)
{a, b, c, d}, (in S)
{a, b, c, e} }. (in S)
Note that we define the union only over a collection of overlapping sets. For
example, when
then,
B∪ (S) = { {e, f }, (in S)
{g, h}, (in S)
{b, c, f }, (in S)
{a, b, c, d}, (in S)
{a, b, c, e}, (in S)
{b, c, e, f }, ({b, c, f } ∪ {e, f })
{a, b, c, d, e}, ({a, b, c, d} ∪ {a, b, c, e})
{a, b, c, d, f }, ({a, b, c, d} ∪ {b, c, f })
{a, b, c, e, f }, ({a, b, c, e} ∪ {b, c, f } ∪ {e, f })
{a, b, c, d, e, f } }. ({a, b, c, d} ∪ {a, b, c, e} ∪ {b, c, f })
{a, b, c, d} ∪ {g, h} ?
The answer is no, since the two sets have no overlap. Note that the set {g, h}
does not overlap with any of the other set in S.
Then, how about the following:
But
{a, b, c, d} ∩ {e, f } = ∅,
i.e., the two have no overlap, so this union also cannot be considered for
membership in B∪ (S). However, consider
{a, b, c, d}, {a, b, c, e}, {b, c, f }.
Here
{a, b, c, d} ∩ {a, b, c, e} =6 ∅ and
{a, b, c, e} ∩ {b, c, f } 6= ∅.
Thus the union of the three sets can be considered as an element of the union
closure:
{a, b, c, d, e, f } = {a, b, c, d} ∪ {a, b, c, e} ∪ {b, c, f }. (13.3)
Note that Equations (13.2) and (13.3) give rise to the same set, but only the
second union is acceptable by the definition of our union closure.
Back to boolean closure. The boolean closure is the union of the inter-
section closure, B∩ (S), and the union closure, B∪ (S),
B(S) = B∩ (S) ∪ B∪ (S),
Thus, boolean closure is all the possible intersection and union of the over-
lapping sets.
LEMMA 13.4
(Straddle graph properties lemma) Let S be a collection of sets defined
on alphabet Σ such that no two sets in S straddle.
1. Then the boolean closure, B(S), is the same as S, i.e.,
S = B(S) = B∩ (S) = B∪ (S).
F(S)
Here
Σ = {a, b, c, d}.
It can be verified that
F(S1 ) = {b a c d, d c a b, b c a d, d a c b}.
Note that each set S ∈ S is such that its members appear consecutively in
each s ∈ F(S1 ). Next, consider
Note that
F(S2 ) = ∅,
i.e., it is not possible to arrange the members of Σ in such a way that each
set S ∈ S appears consecutively.
Can a systematic approach be developed to solve the GCA problem. PQ
tree is a data structure introduced by Booth and Leukar [BL76] to solve the
GCA problem in linear time.
13.5.1 PQ trees
A PQ tree, T , is a directed acyclic graph with the following properties.
1. T has one root (no incoming edges).
2. The leaves (no outgoing edges) of T are labeled bijectively by Σ.
Set-Theoretic Algorithmic Tools 427
a b c d e f f e d b a c
(a) (b)
T with F (T ) = a b c d e f . T ′ with F (T ′ ) = f e d b a c.
Σ = {a, b, c, d, e, f }.
The root node is a P-node and it has six (|Σ|) leaves mapped bijectively to
the elements of Σ. The frontier of a tree T denoted by
F (T ),
is the permutation of Σ obtained by reading the labels of the leaves from left
to right. Note that this definition is valid for any tree, even when the tree is
not a PQ tree.
The frontiers of the PQ trees are shown in Figure 13.3. Two PQ trees T
and T ′ are equivalent, denoted
T ≡ T ′,
if one can be obtained from the other by applying a sequence of the following
transformation rules:
F(T ) = {F (T ′ )|T ′ ≡ T }.
In Figure 13.3,
F(T ) = F(T ′ )
= {a b c d e, a b c e d, c b a d e, c b a e d,
d e a b c, d e c b a, e d a b c, e d c b a}.
LEMMA 13.5
(Frontier lemma) For a tree T , whose leaves are labeled bijectively with the
elements of Σ and each internal node represents the collection of the labels of
the leaf nodes reachable from this node, say
S ∈ S,
F (G(S, Er )).
s1 = a b c d e f g h
s2 = a b c d e f g h
THEOREM 13.1
(Set linear arrangement theorem) Given S, a collection of sets on al-
phabet Σ,
F(S) 6= ∅,
(i.e., there exists a consecutive arrangement of the n members of Σ) if and
only if every straddle graph,
Gstraddle (VS , ES ),
for each S ∈ S, in the reduced partial order graph
G(B(S), Er )
is a chain.
FIGURE 13.4: Since the collection of sets is closed under union of strad-
dling sets, if a node has two parents then it must be part of a pyramid structure
as shown.
FIGURE 13.5: Reduced partial order when the sets can be linearly ar-
ranged.
Figures 13.4 and 13.5(a) show examples of a single pyramid and Figure 13.5(b)
shows possible stacking of three pyramids in a reduced partial order graph.
S=
{ {a, b}, {b, c}, {c, d}, {d, e},
{a, b, c}, {b, c, d}, {c, d, e},
{a, b, c, d}, {b, c, d, e},
{a, b, c, d, e} }.
The reduced partial order graph is shown in Figures 13.6(a) and (b). The
exact same sets are denoted by a single Q-node as shown in (c) of the figure.
If the number of leaf nodes in the reduced partial order graph is k, then the
number of nonleaf nodes nodes is given as:
O(k2 ).
a,b,c,d,e a,b,c,d,e
a b c d e e d c b a
(a) a b c d e. (b) e d c b a.
a b cd e
(c) Q-node of a PQ tree.
FIGURE 13.6: Sets shown at the nodes of a reduced partial order graph of
a sequence (a) and its reversal (b). This entire information can be represented
by a single Q-node of a PQ tree as shown in (c).
a,b,c,d,e,f,g
a,b,c,d,e,f,g
e,f,g
a b c d e f g a b c d e f g
(a) The reduced partial order graph. (b) The PQ tree.
Er = E1 ∪ E2 . . . ∪ El ,
where
Ei ∩ Ej = ∅ with 1 ≤ i < j ≤ l,
and each partition is defined on Ei . By ‘tree structure’ we mean that no node
in the (sub)graph has more than one parent.
Thus we conclude:
Given I, if the elements of X, can be organized as a linear string, then
the reduced partial order graph can be encoded as a PQ Tree.
As the final example, consider
The reduced partial order graph of this collection is shown in Figure 13.7(a).
It can be partitioned into one ‘pyramid structure’ (shown enclosed by a dashed
rectangle) and a ‘tree structure’.
The PQ encoding is as follows: The tree has one Q-node, two P-node and
seven leaf nodes. Set shown as (13.5) is encoded as a single P-node, sets shown
as (13.6) are encoded as a single Q-node.
434 Pattern Discovery in Bioinformatics: Theory & Algorithms
S = C i1 ∩ C i2 ∩ . . . ∩ C ip ,
IS = {i1 , i2 , . . . , ip }.
IS ( I ′ ,
such that
The output is the set of all maximal pairs (S, IS ) such that |S| ≥ K.
S ′ ( S, and IS ′ = IS .
Given a set
S = {σi1 < σi2 < . . . < σil },
define a sequence, seq(S), as
(a)The concrete example. (b) The trie for S. (c) The reduced trie.
FIGURE 13.8: Continuing the concrete example of Section 13.6: The set
of sequences S, following the ordering a < b < c < d < e < f < g and the
corresponding trie in (b). The reduced trie where each internal node has at
least two children is shown in (c).
The trie, TC,K , of the elements of Sq C,K is termed the ordered enumeration
trie or simply the trie of the input C with quorum K. Figure 13.8 displays
the trie for a simple example. What is the size of this trie?
LEMMA 13.6
(Enumeration-trie size lemma) The number of nodes in trie TC,K is no
more than
X X
|s| = |S|,
s∈Sq C,K S∈SC,K
maxSIP(C, K, S, IS , j)
{
IF j > 0 AND |IS | ≥ K
ISnew ← {i | i ∈ IS AND σj ∈ (Ci ∈ C)} //takes O(n) time
IF (|ISnew | ≥ K) {
Snew ← S ∪ {σj }
Terminate ← FALSE
IF (ISold = Exists(T , ISnew )) //takes O(log n) time
IF |ISnew | = |IS | //immediate parent;
//(Snew , IS ) is possibly maximal
Replace(T , Sold , Snew ) //takes O(log n) time
ELSE Terminate ← TRUE //(Snew , ISnew ) is nonmaximal,
//hence terminate this branch
ELSE Add(T , Snew ) //takes O(log n) time
IF NOT Terminate
maxSIP(C, K, Snew , ISnew , j-1) //left-child call
}
Input parameters. The routine in Algorithm (14) takes the following four
parameters:
1. A collection of n sets
C = {Ci , C2 , . . . , Cn },
3. The pair S and IS computed until this point. The algorithm is initiated
with the following settings:
A small example is discussed in Figure 13.10. The input sets are shown in (a)
where the alphabet is
Σ = {a < c < d}.
A complete binary ordered tree, BC,K , for this input is shown in (b). The
root of this tree is labeled with the pair (S, IS ) where
where n = |C|. Every internal node (including the root node) has exactly
two children: the left child is labeled with some σj ∈ Σ, 1 ≤ j ≤ m, and the
right-child is unlabeled and is shown as a dashed edge in the figure.
For a node v at depth of j from the root, define pathv , the path from the
root node down to node v on this complete binary ordered tree, by the labels
on the edges written as
pathv = X1 X2 . . . Xj ,
The node is labeled by the pair (Sv , ISv ) which is defined as:
Sv = Σv ∩ Ci1 ∩ Ci2 ∩ . . . ∩ Cip , and
ISv = {i1 , i2 , . . . , ip }.
where
Σv = X 1 ∪ X 2 ∪ . . . ∪ X j .
In other words,
LEMMA 13.7
Each dashed edge is such that the label, (S, IS ), on both the nodes of this
edge are the same.
440 Pattern Discovery in Bioinformatics: Theory & Algorithms
C1 = {a, c},
C2 = {a, d},
C3 = {a, c, d}.
(a) The input C = {C1 , C2 , C3 }, with quorum K = 2.
(0, {1,2,3})
a
({a}, {1,2,3}) (0, {1,2,3})
c c
({a,c}, ({a}, ({c},
{1,3}) {1,2,3}) {1,3}) (0, {1,2,3})
d d d d
S IS
{a} {1, 2, 3}
{a, c} {1, 3}
{a, d} {2, 3}
(c) The solution (maximal intersection pairs) to the input.
(note that the pair ({c}, {1, 3}) is not maximal)
See Figure 13.10(b) for a concrete example. It can be verified that the ordered
enumeration trie, TC,K , is ‘contained’ in this complete binary ordered tree
BC .
We now relate BC to the pseudocode in Algorithm (14). This tree is implic-
itly generated by the algorithm. Note that this is the complete tree, but the
code will prune the tree (as discussed later) for efficiency purposes. In detail,
the algorithm can be understood as follows. The routine makes at most two
recursive calls
1. marked as left-child call, and
2. marked as the right-child call.
Thus each left-child edge of BC is marked with {σj } while the right-child edge
is labeled (implicitly) with an empty set ∅ (or unlabeled in Figure 13.10(b)).
For easy retrieval the pair (Sv , ISv ) at node v is stored in a balanced tree data
structure T (see below).
Pruning BC . Next, we explore how this complete binary ordered tree is
pruned. This is done by identifying a set of traversal terminating conditions.
Every recursive routine must have a mandatory termination condition for
obvious reasons.4 For efficiency purposes, the routine has four additional
terminating conditions. The terminations of a recursive call corresponds to
the pruning of the complete binary search tree. All the terminating conditions
are discussed below.
1. The ‘mandatory’ terminating condition when all the alphabet has been
explored (the j > 0 condition).
2. We discuss two terminating conditions here, both arising when the com-
puted ISnew is found to be the same as some existing set (ISold or IS ).
This condition ensures that a nonmaximal set S is detected no more
than once, giving an asymptotic improvement in the overall efficiency
(leading to an output-sensitive time complexity) of the algorithm.
(a) Case ISnew = ISold , i.e., the freshly computed ISnew already exists
in the data structure T .
Further if
Sold ⊃ Snew ,
then clearly Snew is nonmaximal. In this case not only Snew can be
discarded but also, the traversal can be terminated here, since it
is guaranteed that all the subsequent sets detected on this subtree
(of the complete binary ordered tree) will be nonmaximal.
However, if
Sold 6⊃ Snew ,
4 Otherwise, the run-time stack will overflow eventually crashing the system.
442 Pattern Discovery in Bioinformatics: Theory & Algorithms
S ′ ⊇ Sold ∪ Snew ,
with
IS ′ = ISnew = ISold .
Hence it must be that
Sold ⊂ Snew
and Snew replaces Sold in the data structure T .
We use the variable Terminate, in the pseudocode to check for this
condition.
(b) Case ISnew = IS (at the ‘right-child’ call).
In fact, it is adequate to simply check if the set sizes are the same,
i.e.,
|IS | = |ISnew |,
which can be done in O(1) time.
Why can we safely terminate the search of this branch (‘right-
child’) on the complete binary ordered tree? Since IS = ISnew , the
5 In other words S
new is maximal until this point in the execution. The possibility remains
that sometime later in the execution, it may turn out to be indeed nonmaximal.
Set-Theoretic Algorithmic Tools 443
condition (3)
I
(|I| = |{3}|) < (K = 2)
(0, {1,2,3})
condition (4)
a II
j = 1 and right-child call
V
({a}, {1,2,3}) condition (1)
III
c ({a},
j = |Σ| = 3
({a,c},
{1,3}) {1,2,3})
condition (4)
d d IV
II j = 1 and right-child call
{3} IV
I ({a,d}, condition (2b)
{2,3}) III IS = ISold
V
right and left
subtrees the same
FIGURE 13.12: The pruned search tree of Figure 13.10. The order of
symbols is a < c < d and quorum K = 2. The different terminating or
pruning conditions are explained above. See text for more details.
two subtrees corresponding to the left and the right child rooted
at this node are identical. Further, all the sets (S) associated with
the nodes of the right subtree are nonmaximal. See Figure 13.11
for a concrete example of nonmaximal sets in the right subtree.
3. If the set size, |ISnew |, falls below quorum, K, the search on the complete
binary ordered tree can be terminated. This gives rise to efficiency in
practice but no asymptotic improvements can be claimed due to this
condition.
4. The very last (when j = 1) ‘right-child’ call need not be made, since it
does not give rise to any new sets S. Again, this gives rise to efficiency
in practice but no asymptotic improvements can be claimed due to this
condition.
For a complete example see Figure 13.12: it shows the pruned version of the
complete binary tree shown in Figure 13.10, along with the various terminat-
ing conditions.
Collapsing the pruned enumeration tree. The dashed edges of the pruned tree
can be collapsed, so that each node in this collapsed tree has a unique label
given as (S, IS ) and the collapsed tree is shown in Figure 13.13(b).
We call this the search tree. We study the characteristics of this tree
(0, {1,2,3})
a (0, {1,2,3})
({a}, {1,2,3}) a
c
({a,c}, ({a}, {1,2,3})
{1,3}) ({a}, {1,2,3}) c
d d ({a,c}, d
{3}
{1,3})
({a,d}, {2,3}) ({a,d},{2,3})
(a) Pruned tree. (b) Collapsed pruned tree, or,
the search trie.
(a) Stubs: These are edges that are incident on the little square nodes.
(b) Regular: These are the ones that are not stubs.
The search tree without the stubs is indeed the trie that we discussed earlier
in the section.
The edges in the search tree (both regular and stub) correspond to the
number of recursive calls in Algorithm (14). Using Lemma (13.6), we obtain
the following:
LEMMA 13.8
The number of regular and stub edges in the search tree is no more than
X
|Σ| |S|.
S∈SC,K
Set-Theoretic Algorithmic Tools 445
(S, IS )
S1 = {a},
IS1 = {1, 2, 3}.
S2 = {a, c},
IS2 = {1, 3}. a c
S3 = {a, d}, ({a}, {1,2,3}) ({c},
a ({a},
IS3 = {2, 3}. {1,3}) {1,2,3})
c
S1 = d S2 =
a d
{seq(S1 ) = a, {seq(S1 ) = a,
seq(S2 ) = ac, ({a,c}, ({a,d}, seq(S2 ) = ca, ({a,c}, ({a,d},
seq(S3 ) = ad}. {1,3}) {2,3}) seq(S3 ) = ad}. {1,3}) {2,3})
(a) a < c < d and the trie for S1 . (b) c < a < d and the trie for S2 .
FIGURE 13.14: The maximal sets shown as solid circles and nonmaximal
as hollow circles in the trie (or the search tree with the stubs). Different
orderings leading to different tries, but the same maximal pairs.
3. Add(T ,(S, IS )) adds (S, IS ) as a new node in the data structure T using
IS as the key.
446 Pattern Discovery in Bioinformatics: Theory & Algorithms
Algorithm time complexity. Let the size of the input NI and the size of
the output NO be given as
X
NI = |C|,
C∈ C
X
NO = (|S| + |IS |) ,
S∈SC,K
Thus
NO = NO1 + NO2 .
The number of calls is bounded by the total number of edges in the search
tree given in Lemma (13.6) as O(NO1 |Σ|). Each routine call, which corre-
sponds to a node in this tree, takes
Thus the amount of work done on all the nodes of the search tree is
O((NO2 + NI )|Σ|).
Set-Theoretic Algorithmic Tools 447
Combining, this with the time taken for the remainder of the routine, the
overall time complexity of Algorithm (14) is given as:
S = C i1 ∩ C i2 ∩ . . . ∩ C ip ,
IS = {i1 , i2 , . . . , ip }.
I ′ ( IS ,
such that
The output is the set of all pairs (S, IS ) such that |S| ≥ K.
13.7.1 Algorithm
We design an algorithm along the lines of Algorithm (14) as follows here.
The algorithm (Algorithm (15)) is almost the same except for two differences:
448 Pattern Discovery in Bioinformatics: Theory & Algorithms
1. There is only one terminating condition in the routine here which is when
the size of the set ISnew falls below the quorum K. In Algorithm (14),
there is yet another terminating condition which is when the set Snew
is nonmaximal.
2. Since multiple sets can be associated with one index set I, note that Sold
is a collection of sets. Further we use a new routine on the balanced
binary tree T called Append(T , Sold , Snew ). This routine first removes
any S ∈ Sold from the collection Sold satisfying
S ⊃ Snew .
minSIP(C, K, S, IS , j)
{
IF (j ≤ 0) EXIT
ISnew ← {i | i ∈ IS AND σj ∈ (Ci ∈ C)}
IF (|ISnew | ≥ K) {
Snew ← S ∪ {σj }
IF (ISold = Exists(T , ISnew ))
IF |ISnew | = |IS | //immediate parent, so dont update T
ELSE Append(T , Sold , Snew ) //multiple minimal sets
//if subset of existing set S ′ , remove S ′ from T
ELSE Add(T , Snew )
minSIP(C, K, Snew , ISnew , j-1)
}
minSIP(C, K, S, IS , j-1)
}
({e}, {1,2,3,4,5,6})
FIGURE 13.15: A search trie with singleton labels on the edges. The solid
nodes represent maximal set pairs. The hollow nodes with a pointing arrow
represent minimal set pairs.
(SA , ISA ).
The following observation follows directly from the construction of the trie.
LEMMA 13.9
(Trie partial-order lemma) Node B is a descendent of node A in the trie,
if and only if the following two statements hold:
SB ⊃ SA and
ISB ⊆ ISA .
13.8 Multi-Sets
We next consider multi-sets, where the multiplicity of the elements (some-
times also called copy number) is taken into account. For example a multi-set
is given as:
S = {σ(k) | σ ∈ Σ, k ≥ 1}.
In other words, each element σ also has a copy number stating the number
of times σ appears in the set. The sets considered in Section 13.6 were such
that k = 1 for each σ. For example, if
S = {a(2), b(4)},
then multi-set S has two copies of a and four copies of b. Let
Σ′ = {σ(c) | σ ∈ Σ and c is a copy number in the data}.
2. Given multi-sets S1 , S2 , . . . , Sp ,
S1 ∩ S2 ∩ . . . ∩ Sp = {σ(kmin ) | kmin = min (ki ), where σ(ki ) ∈ Si }.
1≤i≤p
Let quorum K = 2. What are the maximal multi-sets? Using the problem
specification, the maximal intersection multi-sets are given below.
S1 = {a(1), b(2)} with IS1 = {1, 2, 3},
S2 = {a(1), b(6)} with IS2 = {1, 2},
S3 = {a(2), b(2)} with IS3 = {1, 3}, and
S4 = {a(1), b(2), d(2)} with IS4 = {2, 3}.
We use this as the running example in the remainder of the discussion.
a(1)
a(1)
S1 = {a(1), b(2)}, a(2)
S2 = {a(1), b(6)},
b(2)
S3 = {a(2), b(2)}, a(2)
S4 = {a(1), b(2), d(2)}. b(2)
a(1)b(2)
b(6)
S=
{seq(S1 ) = a(1) b(2), d(2)
a(2),b(2) a(1)b(6)
seq(S2 ) = a(1) b(6),
seq(S3 ) = a(2) b(2),
seq(S4 ) = a(1) b(2) d(2)}. a(1)b(2)d(2)
Trie of multi-sets. As for the other sets, we first define an ordering on the
elements of Σ. Let
Σ = {σ1 < σ2 < . . . < σm }.
Given a set
S = {σi1 (ci1 ), σi2 (ci2 ), . . . , σil (cil )},
with
σ i 1 < σ i 2 < . . . σi l ,
define a sequence, seq(S), as
In the sequence, each element σi appears exactly once and is annotated with
the copy number ci . A trie of the sequences is a tree satisfying the following
properties.
2. No two siblings in the trie have a label of the form σ(-), i.e, the same
symbol σ.
Set-Theoretic Algorithmic Tools 453
a(1)
a(2) b(2)
FIGURE 13.17: Let Σ = {a < b < d} where the data shows these
copy numbers: 1, 2, 3 for a; 2, 6 for b; 2 for d. The complete binary ordered
enumeration tree for such an input is shown above. To avoid clutter, the
labels on the nodes are omitted.
The running example is discussed in Figure 13.16. This shows how the
sequence is ‘represented’ by the trie (encoded at the nodes). Each edge is
labeled by the element σ and its copy number. What is the size of the trie?
In fact, it is the same as before and Lemma (13.6) holds even for the multi-sets.
The complete binary ordered tree. We define the Complete Binary Or-
dered Tree as before with a few additional properties due to the multiplicities.
For each σj ∈ Σ, let the jn copy numbers of σj be as
1. As before, every internal node (including the root node) has exactly two
children. The edges are labeled as follows. The right-child (edge) is
unlabeled and is shown as a dashed edge in the figure. The left-child
(edge) is labeled as follows.
(a) The left child of the root node is labeled with σ1 (c11 ).
(b) Let an internal node v have an incoming edge labeled as σj (cjl ),
then its left child has the following label:
σj (cjl+1 ), if (l + 1) ≤ jn ,
σj+1 (cj1 ), otherwise.
454 Pattern Discovery in Bioinformatics: Theory & Algorithms
See Figure 13.17 for the running example. The tree is pruned using exactly
the same terminating conditions as before and the pruned tree for the running
example is shown in Figure 13.18(a) and the corresponding collapsed tree, or
the search tree, is shown in Figure 13.18(b). Figure 13.19 shows the search
tree for a different ordering of the elements of Σ.
1 2 3 2 6 2
↓ ↓ ↓ ↓ ↓ ↓
Dica→ 2 → 1 → 3 –| Dicb→ 3 → 1 → 2 –| Dicd→ 2 → 3 –|
The sublist corresponding to cjr is denoted as Dicσj (cjr ) . Note that each
sublist includes all elements to the right, thus with an abuse of notation if
Dicσ(c) denotes the set of the elements in the list, then
//For each σj ∈ Σ,
//let the jn copy numbers of σj be sorted as:
// cj1 < cj2 < . . . < cjn .
maxMIP(C, K, S, IS , j, r)
{
IF j > 0 AND |IS | ≥ K
ISnew ← {i | i ∈ IS , Dicσj (cjr ) }
IF (|ISnew | ≥ K) {
Snew ← S ∪ {σj (cjr )}
Terminate ← FALSE
IF (ISold = Exists(T , ISnew ))
IF |ISnew | = |IS | //immediate parent;
//S is possibly maximal
Replace(T , Sold , Snew )
ELSE Terminate ← TRUE //S is nonmaximal,
//hence terminate this branch
ELSE Add(T , Snew )
IF NOT Terminate
(0,{1,2,3})
a(1)
({a(1)},{1,2,3})
a(2)
({a(2)},{1,3}) ({a(1)},{1,2,3})
b(2)
({a(2)},{1,3}) ({a(1),b(2)},{1,2,3})
b(2) b(6)
({a(2),b(2)}, ({a(1),b(6)}, ({a(1),b(2)},{1,2,3})
{1,3}) {1,2}) d(2)
({a(2),b(2)},
{1,3}) ({a(1),b(2),d(2)},
{2,3})
(a) The pruned tree.
(0,{1,2,3})
a(1)
({a(1)},{1,2,3})
a(2)
({a(2)},{1,3}) b(2)
b(2)
({a(1),b(2)},{1,2,3})
b(6)
({a(2),b(2)}, ({a(1),b(6)}, d(2)
{1,3}) {1,2})
({a(1),b(2),d(2)},
{2,3})
(b) The collapsed pruned tree or the search tree.
FIGURE 13.18: The complete binary tree of Figure 13.17 has been
pruned by using the different terminating conditions in the routine. This
tree has been further collapsed to obtain the search tree. The maximal sets
are shown as solid circles.
Set-Theoretic Algorithmic Tools 457
(0,{1,2,3})
b(2)
({b(2)},{1,2,3})
b(6)
({b(6)},{1,2}) a(1)
a(1) ({a(1),b(2)},{1,2,3})
a(2)
({a(1),b(6)}, ({a(2),b(2)}, d(2)
{1,2}) {1,3})
({a(1),b(2),d(2)},
{2,3})
FIGURE 13.19: The ordered enumeration trie with b < a < d. The nodes
with maximal pairs as shown as solid circles.
more than two at each call). At some time point, these instances, due to
the recursive nature of the process, are partially executed and are waiting for
other instances to complete their execution before they can complete their
own. For very large data sets, the number of such partially executed routines
may pile up, eventually running out of space in the computer memory.
Is the depth-first order of traversal really crucial? There are at least two
implications of this order of traversal.
1. Run time efficiency: The order ensures that a maximal set has been seen
before its nonmaximal versions. This results in an effective pruning of
the trie due to termination conditions (2) of Section 13.6. The only
exception is when the intersection set S is being built, one element σ at
a time.
|S1 ∩ S2 |
(S1 ≈ S2 ) ⇔ ≥δ
|S1 ∪ S2 |
458 Pattern Discovery in Bioinformatics: Theory & Algorithms
13.10 Exercises
Exercise 157 (Unique transitive reduction lemma)
1. Show that Er (of Section 13.3) is the smallest set of edges such that
|Er | = |E ′ |, and
C(G(S, Er )) = C(G(S, E ′)),
1. The partial order is called a total order if for any two nodes S1 , S2 ∈ S
there is
In this case,
|Er | = n + 1,
as shown below. However, without the terminating nodes, the number of
edges is n − 1.
n
Set-Theoretic Algorithmic Tools 459
2. The partial order is called an empty order if for any two nodes S1 , S2 ∈
S, there is
In this case,
|Er | = 2n,
as shown below. However, without the terminating nodes, the number of
edges is zero.
In the worst case, how many edges can a reduced partial order with n nodes
have?
n/2
Let
Σ = {σ1 , σ2 , . . . , σn/2 }.
Let a node in the first column be S1,i and in the second column be S2,i , then
for i = 1, 2, . . . , n/2,
S1,i = {σi },
S2,i = Σ \ {σi }.
460 Pattern Discovery in Bioinformatics: Theory & Algorithms
2. Show that any pair of siblings in a reduced partial order graph are in-
comparable.
Hint: (2) Does the statement hold when the partial order is not reduced?
S m = S 1 ∪ S2 ∪ . . . ∪ Sh ,
VS = VS ′ and ES = ES ′ .
1. Then show that the boolean closure, B(S), is the same as S, i.e.,
is acyclic.
Set-Theoretic Algorithmic Tools 461
Exercise 162 (Frontiers) Enumerate F(T ) for the PQ tree, T , shown be-
low.
a b c d e f g
What is |F(T )|, for the given T ?
2. If every node with two parents has the mandatory structure, then is
F(S) 6= ∅,
F(S) = ∅,
G(B(S), Er ).
Hint: (1) The forbidden structure shows nodes with more than two parents.
Note that the graph is a reduced partial order of the boolean closure. (2)
Consider the example shown below. Does each node with multiple parents,
respect the mandatory structure? What is F(S)?
a,b,c
a b c
3. If
|Σ| = O(NI ),
how can the enumeration be changed to exploit this fact?
c1 c2 c3
↓ ↓ ↓
Dicσ→ 3 → 7 → 2 → 4 → 5 → 6 → 8 –|
(a) What does the list assume about the ordering of the elements
c 1 , c2 , c3 ?
Condition (2):
(SA , ISA )
Hint: For a fixed index set I ′ , how many nodes in the tree have label (S, IS )
with I ′ = IS ? How are these nodes arranged on the tree? Use Lemma (13.9).
a < b < d,
b < a < d,
(0,{1,2,3})
a(1) VIII
({a(1)},{1,2,3})
a(2)
({a(2)},{1,3}) ({a(1)},{1,2,3})
I b(2) VII
({a(2)},{1,3}) ({a(1),b(2)},{1,2,3})
b(2) b(6)
IV
({a(2),b(2)}, ({a(1),b(6)}, ({a(1),b(2)},{1,2,3})
{1,3}) II {1,2}) d(2) VI
V
({a(2),b(2)},
{1,3}) ({a(1),b(2),d(2)},
III {2,3})
Lp′ = Lp .
Set-Theoretic Algorithmic Tools 467
maxStIP(K, σj , Lp , p)
IF (j ≤ |Σ|) AND (|Lp | ≥ K)
Lpsav ← Lp , psav ← p
FOR EACH i ∈ Lpsav
IF i 6∈ Dicσj Lp ← Lp \ {i}
p ← p ∪ {σj }
Quit ← ((|Lp | < K) OR (p ⊂ ExistPat(Lp )))
IF NOT Quit
StorePat(Lp , p) //new or updated motif
maxStIP(K, σj+1 , Lp , p) //with σj
IF |Lp | < |Lpsav | //only if the two are distinct
maxStIP(K, σj+1 , Lpsav , psav ) //ignoring σj
Exercise 174 (Tree data structure) For an input C and quorum K, what
is the relationship between the trie
TC,K
and the data structure T to store the pairs (S, IS )?
Hint: Consider the reduced trie where each internal node as at least two
children. Can the trie be balanced?
468 Pattern Discovery in Bioinformatics: Theory & Algorithms
Comments
The material in this chapter is fairly straightforward. At first blush, it even
seems like it does not deserve the dignity of a dedicated chapter. However, I
have seen the very same problem pop up in so many hues and shapes that per-
haps an unobstructed treatment of the material, within its very own chapter,
will do more good than harm.
Chapter 14
Expression & Partial Order Motifs
14.1 Introduction
Consider the task of capturing the commonality across data sequences in a
variety of scenarios. Depending on the data and the domain, the questions
change.
1. Total order: Segments appear exactly at each occurrence and these are
called the string patterns.
Certain wild cards may be allowed or even flexible extension of the gap
regions (called extensible motifs).
2. No order (but proximity): If groups of elements appear together, even
if they respect no order, these clusters may be of interest. These are
called permutation patterns.
Again, they may show some substructures of proximity within them (as
PQ structures).
3. Partial order: Is it possible that key players are only partially ordered?
The key players themselves could be as simple as motifs or clusters or
as complex as a boolean expression.
Further, if the input is organized as a sequence and this order must be
important, these can be modeled as the mathematical structure partial
order.
Of course, a more general order is defined as a graph and topological mo-
tifs can be discovered from this organization of data possibly providing
some insight into the process that produced this data.
As the landscape changes, the questions change and so do the answers. There
is an interesting interplay of different ideas such as permutations of motifs or
extensible motifs of permutations or partial orders of expressions and so on.
469
470 Pattern Discovery in Bioinformatics: Theory & Algorithms
14.1.1 Motivation
In the following, mini-motifs refer to string motifs. Consider the results
obtained by mining patterns in binding site layouts in four yeast species as
studied in Kellis et al. [KPE+ 03]: S. cerevisiae, S. paradoxus, S. mikatae, and
S. bayanus.
Out of 45,760 mini-motifs, some 2419 significantly conserved mini-motifs
are grouped into 72 consensus motifs. 1 For the small fraction of sequences
where some motifs occur more than once, only the position closest to the
TATA box is utilized. Many of these motifs correspond to known transcription
factor binding sites [ZZ99] whereas others are new and putative, supported
indirectly by co-expression or functional category enrichment.
We use the number id’s for the motifs and show an example below. Is there
more structure than just the cluster?
37 ∧ 66 ∧ 5
The symbol ∧ denotes ‘and’ and ∨ denotes ‘or’. Notice that motif 37 always
precedes motif 66, but motif 5 could be in any relative position, in each
of the clusters. This is captured in the partial order shown to the right
below. Symbols S and E are meta symbols that denote the left and right ends
respectively.
Spar (YDR034C-A): 37 66 5
Spar(YJL008C): 5 37 66 37 66
Spar (YJL007C): 5 37 66 S E
5
Spar (YMR083W): 37 66 5
Spar (YOR377W): 37 5 66
(1) Input sequence data. (2) Partial order motif.
48 ∧ 55 ∧ 37 ∧ 5 ∧ 24
is shown below. Here motif 48 always precede motif 55 and motif 37 always
precedes motif 24. The common ordering of the elements is captured in the
The genes and motifs in this example are enriched for multiple stress response
pathways (e.g., HSPs, Ras) and sensing of extracellular amino acids.
implies that either both m1 and m2 occur or m3 occurs in a sequence for which
this expression e holds. Alternatively, the same expression can be written as
e = m1 m2 + m3 .
m.
V ⊆ F.
Π(e)
O(e) = O(e = m1 ∧ m2 ∨ m3 )
= O(m1 ) ∩ O(m2 ) ∪ O(m3 )
= O(m1 m2 + m3 ).
e = m1 ∧ m2 ∨ m3
= m1 ∩ m2 ∪ m3
= m1 m2 + m3 .
O(m1 ) \ O(m2 )
or simply
m1 \ m2 .
Two expressions e1 and e2 defined over V1 and V2 respectively are distinct
(denoted as e1 6= e2 ), if one of the following holds:
(i) V1 6= V2 , or
e Π(e) O(e) √
0 ∅ ∅
m1 m2 {m1 , m2 } {1}
m1 m2 {m1 , m2 } {2}
m1 m2 {m1 , m2 } {3} √
m1 m2 {m1 , m2 } {4}
m1 {m1 } {1, 2}
m2 {m2 } {1, 3}
m1 m2 m2 m2 + m1 m2 {m1 , m2 } {1, 4}
0 0 m1 m2 + m1 m2 {m1 , m2 } {2, 3}
√
0
I= 1 . m2 {m2 } {2, 4} √
1 0 m1 {m1 } {3, 4}
1 1 m1 + m1 m2
{m1 , m2 } {1, 2, 3}
m2 + m1 m2
m1 + m1 m2
{m1 , m2 } {1, 2, 4}
m2 + m1 m2
m1 + m1 m2
{m1 , m2 } {1, 3, 4}
m2 + m1 m2
m2 + m1 m2
{m1 , m2 } {2, 3, 4}
m1 + m1 m2 √
1 {m1 , m2 } {1, 2, 3, 4}
Notice that this condition rules out tautologies. For example, using the set
notation, the expressions
m1 ∩ m4
and
m1 \ (m1 \ m4 )
are not distinct.
An expression e is in conjunctive normal form (CNF) if it is a conjunction
of clauses, where a clause is a disjunction of literals. For example,
is in CNF form.
An expression e is in disjunctive normal form (DNF) if it is a disjunction
of clauses, where a clause is a conjunction of literals. For example,
e = m1 m2 + m3 m4 + m5 + m6 + m1 m4
is in DNF form.
It is straightforward to prove that any expression e can be written either
in a CNF or a DNF form and we leave this as an exercise for the reader
(Exercise 176).
A boolean expression is very powerful since it has the capability of ex-
pressing very complex interrelationships between the variables (motifs in this
case). See Figure 14.1 for an example. However, it is this very same power
that renders it ineffective: it can be shown that there always exists a boolean
expression that precisely represents any collection of rows in any incidence
matrix I. See and Exercise 175 and Figure 14.1 for an example.
So we focus on a particular subclass of boolean expression called monotone
expression [Bsh95]. This is a subclass of boolean expressions that uses no
negation. In other words, it uses only conjunctions and disjunctions. See the
marked expressions in the example of Figure 14.1.
However, an expression is called monotone because it displays monotonic
behavior (see Exercise 177), thus care needs to be taken to determine if an
expression e is monotone. For example consider
e = m1 m2 + m1 m2 ,
e = m1 m2 + m1 m2
= (m1 + m1 )m2
= 1 m2
= m2 .
|O(e)| ≥ K.
Note that since the expressions are restricted to be monotone, this specifica-
tion is nontrivial. We say it is nontrivial since it is possible that there exists
a collection of rows, V , of I such there exist no expression e with
O(e) = V.
In other words, the solution for the problem is not simply all K ′ -sized subsets
of the rows where
K ′ ≥ K.
j ′ 6∈ V with I[i, j ′ ] = cj ′ ,
for each i ∈ O and some fixed cj ′ . These conditions define the ‘constant
columns’ type of biclusters. See [MO04] for different flavors of biclusters used
in the bioinformatics community.
The bicluster is minimal if for each j ∈ V , the collection of rows O and the
collection of columns
V \ {j}
m1 m2 m3 m4 m5 m6 m7 m1 m2 m3 m4 m5 m6 m7 i
0
1 1 1 1 1 1
0
1 1 1 1 1 1 ←1
1 0 0 1 0 0 0 0 1 0 2
0 1 1 1 0 1 1 0 1 1 1 0 1 1 ←3
I′ =
1
I = 0 0 1 0 1 1 . 0 1 1 4.
0
1 1 1 1 1 0
0
1 1 1 1 1 0 ←5
0 1 0 0 0 0 1 1 0 0 6
0 1 0 1 1 1 0 0 1 0 1 1 1 0 ←7
1 0 0 0 0 0 1 0 0 0 8
Corresponding Corresponding
conjunctive form in I: disjunctive form in I:
e1 = m2 m4 m6 , e2 = m2 + m4 ,
O(e1 ) = O1 = {1, 3, 5, 7}. O(e2 ) = O2 = {2, 4, 6, 8}.
LEMMA 14.1
(Flip lemma) Given an (n × m) incidence matrix I, for some 1 ≤ l ≤ m,
e = f1 ∨ f2 ∨ . . . ∨ fl ,
Expression & Partial Order Motifs 477
Figure 14.2 shows an example of maximal and minimal biclusters and the
corresponding expressions. Note that I is defined as
1 if Iij = 0,
I ij =
0 if Iij = 1.
The reader is directed to Chapter 13 for algorithms on finding maximal and
minimal biclusters, which is mapped to the problem of finding maximal and
minimal set intersections respectively. We leave the mapping of this con-
struction as Exercise 180 for the reader. Using Lemma (14.1), the mining of
monotone CNF expressions is staged as follows.
1. Find all minimal monotone disjunctions in I, by performing the follow-
ing two substeps:
(a) Find all minimal conjunctions on I.
(b) Extract all minimal monotone disjunctions by negating each of
these computed minimal conjunctions stored in T (see Lemma 14.1).
For example, if the minimal conjunction is e = f 1 ∧ f 2 (since I is
used) then the minimal disjunction is e′ = f1 ∨ f2 . Let the number
of minimal disjunctions computed be d.
2. Copy matrix I to I ′ . Augment this new matrix I ′ with the results of
the last step as follows. For each minimal disjunction form e′ , introduce
a new column c in I ′ with
if i ∈ O(e′ ),
′ 1
I [i, c] =
0 otherwise.
The augmented matrix, I ′ , is then of size n × (m + d). Next, find all
monotone conjunctions as maximal biclusters in I ′ .
A concrete example is shown below. To avoid clutter, I is not shown. However
e = m1 m2 is detected as a minimal (conjunction) bicluster in I with support
{3, 4}. Thus I ′ has the new column m1 + m2 with support {1, 2, 5}. Some
solutions on I ′ are shown below.
|v1 − v2 | < δj .
then clearly,
nj = |Fj | ≤ n.
Let
Fj = {σj1 , σj2 , . . . , σjnj }.
Thus using some m fixed values
δ1 , δ2 , . . . , δm ,
the real matrix M is transformed to a matrix Q with discrete values, i.e.,
Q[i, j] ∈ Fj for each i and j.
Next, we stage the bicluster detection problem (or pattern extraction from
microarrays) as a maximal set intersection problem in the following steps.
1. For each column j and for each symbol σjk where 1 ≤ k ≤ jn , compute
the following sets of rows:
Sjσjk = {i | Q[i, j] = σjk }.
This gives mnj nonempty sets.
2. Invoke an instance of the maximal set intersection problem (of Sec-
tion (13.6), with the mnj nonempty sets and quorum K.
3. The solution of Step 2 is mapped back to the solution of the original
(bicluster) problem. This is a straightforward process and we leave the
details as an exercise for the reader (Exercise 180).
A simple concrete example is shown below. Let δj = 0.5, for 1 ≤ j ≤ 3.
Only nonsingleton sets are shown in Step 1. The two bicluster patterns are
shown in the input array at the bottom.
g1 g2 g3 g1 g2 g3
1 1.0 3.1 2.85 1 a d a S3a = {1, 3},
S = {1, 3, 4},
2 2.0 2.5 3.4 ⇒ 2 b c b, c ⇒ 1a ∩ S3b = {2, 3},
S1b = {2, 4}.
3 1.25 1.9 3.1 3 a b a, b S3c = {2, 4}.
4 1.5 0.7 3.7 4 a, b a c
M Q Maximal Set Intersection Problem
g1 g2 g3 g1 g2 g3
1 1.0 3.1 2.85 ← 1 1.0 3.1 2.85
2 2.0 2.5 3.4 2 2.0 2.5 3.4 ←
⇒ 3 1.25 1.9 3.1 ← 3 1.25 1.9 3.1
4 1.5 0.7 3.7 4 1.5 0.7 3.7 ←
↑ ↑ ↑ ↑
S1a ∩ S3b ={1, 3} S1b ∩ S3c ={2, 4}
This method of detecting bicluster patterns has even been applied to protein
folding data in an attempt to understand the folding process at a higher
level [PZ05, ZPKM07].
480 Pattern Discovery in Bioinformatics: Theory & Algorithms
F = {m1 , m2 , . . . , mL },
where
L = |F |.
A binary relation B,
B ⊂ F × F,
(a subset of the Cartesian product of F ) is a partial order if it is
1. reflexive,
2. antisymmetric, and
3. transitive.
For any pair m1 , m2 ∈ F ,
m1 m2 if and only if (m1 , m2 ) ∈ B.
In other words. (m1 , m2 ) 6∈ B if and only if m1 6 m2 .
A string q is compatible with B, if for no pair m2 preceding m1 in q, m1 m2
holds in B. In other words, the order of the elements in q does not violate the
precedence order encoded by B. A compatible q is also called an extension
of B. q is a complete extension of B, if q contains all the elements of the
alphabet F . Such a q is also called a permutation on F . Also,
P rm(F ) = {q | q is permutation on F },
Cex(B) = {q | q is a complete extension of B}.
2
P rm(F ) is the set of all possible permutations on F . Thus
Cex(B) ⊆ P rm(B).
LEMMA 14.2
(Reduced-graph lemma) If G(F, E) is the transitive reduction of a partial
order, then for any pair, m1 , m2 ∈ F , if there is a directed path of length
larger than 1 from m1 to m2 in G(F, E), then edge (m1 m2 ) 6∈ E.
THEOREM 14.1
(Pair invariant theorem) Let
G(F, E)
be the solution to Problem 27. (m1 m2 ) ∈ E, if and only if for each qi ,
482 Pattern Discovery in Bioinformatics: Theory & Algorithms
1. m1 precedes m2 and
The proof is left as an exercise for the reader (Exercise 184). Note the
equivalence of the following sets:
m1 precedes m and
L(m1 m2 ) = m ∈ F
m precedes m2 , in each qi
m is a descendent of m1 and
= m∈F .
m2 is a descendent of m, in G(F, E)
S 6∈ F
2. The only vertex with no incoming edges is S and the only vertex with
no outgoing edges is E.
is constructed from
Gi (F, E i )
q1 = 1 2 3 4 3 4 3
q2 = 3 1 4 2 S E S 4 E
q3 = 4 1 3 2 S 1 2 3 4 E 1 2 1 2
(a) Input. (b) G1 . (c) G2 . (d) G3 .
FIGURE 14.3: Incremental construction of G(F, E) = G3 .
Expression & Partial Order Motifs 483
2. E ′′ is the set of new edges that must be added to the DAG due to the
qi+1 and is defined (constructed) as follows. A pair of characters m1
and m2 are imm-compatible if all of the following hold:
(a) m1 precedes m2 in qi+1 with
L′ = {m | m1 precedes m and m precedes m2 in qi+1 }.
(b) m1 is an ancestor of m2 in Gi (F, E i ) with
L′′ = L(m1 m2 ).
(c) L′ ∩ L′′ = ∅.
Then
E ′′ = {(m1 m2 ) 6∈ E i | m1 and m2 are imm-compatible}.
The proof of correctness of the algorithm is left as an exercise for the reader
(Exercise 185).
What is the size of a reduced partial order graph, in the worst case? See
Exercise 158 of Chapter 13.
F ∪ F ′.
Also
|F ′ | = O(|F |).
q1 = a b c d e g f,
q2 = b a c e d f g,
q3 = a c b d e f g,
q4 = e d g f a b c,
q5 = degfbac
with
F = {a, b, c, d, e, f, g.}.
However, notice that certain blocks appear together as follows:
q1 = a b c de gf ,
q2 = b a c ed fg ,
q3 = a c b de fg ,
q4 = e d gf abc ,
q5 = d e gf bac .
The reduced partial order graph for this input is shown in Figure 14.5(a).
In G′ , the alphabet is augmented with
q1 = 1 2 3 4 3 q4 = 4 1 2 3
q2 = 3 1 4 2 S 4 E q5 = 1 3 4 2
q3 = 4 1 3 2 1 2 q6 = 1 4 2 3
(a) Input I. (b) The reduced partial order DAG G. (c) Compatible q’s.
FIGURE 14.4: Incremental construction of G(F, E) = G3 .
Expression & Partial Order Motifs 485
a c S1 a c E1
b b
S E S E
d f d S3 f
e g S2 e E2 g E3
The boxed elements in Figure 14.5(b) are clusters that always appear to-
gether, i.e., are uninterrupted in I. For example, if
q = a b d e f c g,
then clearly
q ∈ G,
but since the cluster {f, g} is interrupted by c and also the cluster {a, b, c} is
interrupted,
q 6∈ G′ .
These clusters are flanked by meta symbols, Si on the left and Ei on the
right, in the augmented partial order. Thus this scheme forces the elements
of the cluster to appear together, thus reducing excess.
We leave the details of this scheme as Exercise 188 for the reader.
Figure 14.7 shows an example of a partial order and its inverse. If a partial
order B is such that
Cex(B) = Cex(B),
486 Pattern Discovery in Bioinformatics: Theory & Algorithms
1
3 1 2
1 S E
S 4 E S 2 4 E 3
2 3 4
(a) Partial order B1 (b) Partial order B2 (c) Partial order B3
√ √ √ √
1 2 3 4 √ 4 3 2 1 × 1 2 3 4 √ 4 3 2 1 × 1 2 3 4 4 3 2 1
√ √
1 2 4 3 √ 3 4 2 1 × 1 2 4 3 √ 3 4 2 1 × 1 2 4 3 3 4 2 1
√ √
1 3 2 4 √ 4 2 3 1 × 1 3 2 4 √ 4 2 3 1 × 1 3 2 4 4 2 3 1
√ √
1 3 4 2 √ 2 4 3 1 × 2 1 3 4 √ 4 3 1 2 × 2 1 3 4 4 3 1 2
√ √
1 4 2 3 √ 3 2 4 1 × 2 1 4 3 √ 3 4 1 2 × 2 1 4 3 3 4 1 2
√ √
1 4 3 2 √ 2 3 4 1 × 2 3 1 4 √ 4 1 3 2 × 2 3 1 4 4 1 3 2
√ √
2 1 3 4 √ 4 3 1 2 × 2 3 4 1 √ 1 4 3 2 × 2 3 4 1 1 4 3 2
√ √
2 1 4 3 3 4 1 2 × 2 4 1 3 √ 3 1 4 2 × 2 4 1 3 3 1 4 2
√ √
2 3 1 4 × 4 1 3 2 × 2 4 3 1 √ 1 3 4 2 × 2 4 3 1 1 3 4 2
√ √
2 4 1 3 × 3 1 4 2 × 3 1 2 4 √ 4 2 1 3 × 3 1 2 4 4 2 1 3
√ √
3 1 2 4 × 4 2 1 3 × 3 2 1 4 √ 4 1 2 3 × 3 2 1 4 4 1 2 3
√ √
3 2 1 4 × 4 1 2 3 × 3 2 4 1 1 4 2 3 × 3 2 4 1 1 4 2 3
(d) Cex(B1 ) (e) Cex(B2 ) (f) Cex(B3 )
The proof, that this is the only nonempty degenerate partial order, is straight-
forward and we leave it as an exercise for the reader (Exercise 189). In other
words, we can focus on just nondegenerate partial orders.
For a nondegenerate partial order B, what is the relationship between
2 1 2 1
S E E S
3 4 3 4
(a) Partial order B. (b) Partial order B.
FIGURE 14.7: A partial order B and its inverse partial order B.
Expression & Partial Order Motifs 487
1 1
S 2 4 E E 2 4 S
3 3
(a) Partial order B (b) Partial order B
√ √
1 2 3 4 √ 4 3 2 1 × 1 2 3 4 × 4 3 2 1 √
1 2 4 3 √ 3 4 2 1 × 1 2 4 3 × 3 4 2 1 √
1 3 2 4 √ 4 2 3 1 × 1 3 2 4 × 4 2 3 1 √
2 1 3 4 √ 4 3 1 2 × 2 1 3 4 × 4 3 1 2 √
2 1 4 3 √ 3 4 1 2 × 2 1 4 3 × 3 4 1 2 √
2 3 1 4 √ 4 1 3 2 × 2 3 1 4 × 4 1 3 2 √
2 3 4 1 √ 1 4 3 2 × 2 3 4 1 × 1 4 3 2 √
2 4 1 3 √ 3 1 4 2 × 2 4 1 3 × 3 1 4 2 √
2 4 3 1 √ 1 3 4 2 × 2 4 3 1 × 1 3 4 2 √
3 1 2 4 √ 4 2 1 3 × 3 1 2 4 × 4 2 1 3 √
3 2 1 4 √ 4 1 2 3 × 3 2 1 4 × 4 1 2 3 √
3 2 4 1 1 4 2 3 × 3 2 4 1 × 1 4 2 3
(c) Cex(B) (d) Cex(B)
FIGURE 14.8: A partial order and its inverse is shown in the top row.
The next row shows all 24 permutations
√ in P rm(·) for each. The elements
of Cex(·) are marked with and the rest are marked with ×. Also, each
permutation in the boxed array is the inverse of the one to its left in the
unboxed array.
488 Pattern Discovery in Bioinformatics: Theory & Algorithms
3 3
1 1
S 4 E E 4 S
2 2
(a) Partial order B (b) Partial order B
√ √
1 2 3 4 √ 4 3 2 1 × 1 2 3 4 × 4 3 2 1 √
1 2 4 3 √ 3 4 2 1 × 1 2 4 3 × 3 4 2 1 √
1 3 2 4 √ 4 2 3 1 × 1 3 2 4 × 4 2 3 1 √
1 3 4 2 √ 2 4 3 1 × 1 3 4 2 × 2 4 3 1 √
1 4 2 3 √ 3 2 4 1 × 1 4 2 3 × 3 2 4 1 √
1 4 3 2 √ 2 3 4 1 × 1 4 3 2 × 2 3 4 1 √
2 1 3 4 √ 4 3 1 2 × 2 1 3 4 × 4 3 1 2 √
2 1 4 3 3 4 1 2 × 2 1 4 3 × 3 4 1 2
2 3 1 4 × 4 1 3 2 × 2 3 1 4 × 4 1 3 2 ×
2 4 1 3 × 3 1 4 2 × 2 4 1 3 × 3 1 4 2 ×
3 1 2 4 × 4 2 1 3 × 3 1 2 4 × 4 2 1 3 ×
3 2 1 4 × 4 1 2 3 × 3 2 1 4 × 4 1 2 3 ×
(c) Cex(B) (d) Cex(B)
FIGURE 14.9: A partial order and its inverse is shown in the top row. The
next row shows all√24 permutations in P rm(·) for each. The elements of Cex(·)
are marked with and the rest are marked with ×. Also, each permutation
in the boxed array is the inverse of the one to its left in the unboxed array.
Notice that there are some permutations that belong to neither Cex(B) nor
Cex(B).
Expression & Partial Order Motifs 489
It is instructive to study the two examples shown in Figures 14.8 and 14.9.
We leave the proof of the following lemma as Exercise 190 for the reader.
LEMMA 14.3
Let B be a nondegenerate partial order.
2. The converse of the last statement is not true, i.e., there may exist
q ∈ P rm(B) such that
4.
|Cex(B)| = |Cex(B)|,
|P rm(B)|
|Cex(B)| ≤ .
2
LEMMA 14.4
(Symmetric lemma) For a nondegenerate partial order B, if the DAG of
B is isomorphic to the DAG of B, then
n!
|Cex(B)| = |Cex(B)| = .
2
much simpler scheme to estimate the lower and upper bounds of the size of
Cex(B).
Consider the DAG
G(V, E)
q = a b c d e,
q1 = a c e,
q2 = c a,
Π(q1 ) ∩ Π(q2 ) = ∅.
Then
q = q1 ⊕ q2 ,
is defined as follows:
col(v) = depth(v).
Ci = {v | col(v) = i}.
The depth(v) of each v can be computed in linear time using a breadth first
traversal (BFS) of the DAG (see Chapter 2).
Expression & Partial Order Motifs 491
S E S E
(1) (2)
FIGURE 14.10: Two possible grid assignments of nodes of a partial order
DAG. The C’s are the same but the R’s differ in the two assignments.
Then
row(v1 ) = row(v2 ) = . . . = row(vl ).
A depth first traversal (DFS) of the DAG (see Chapter 2) can be used to
compute row(v) for each v satisfying these constraints.
Let
Ri = {v | row(v) = i}.
Let the number of nonempty C’s be c and let the number of nonempty R’s
be r. We use the following convention:
ni = |Ri |, for 1 ≤ i ≤ r.
At the end of the process row(v) and col(v) have been computed for each
v. It is possible to obtain different values of row(v) satisfying the condition.
But that does not matter. We are looking for a small number of R sets with
as large a size (|R|) as possible. However, this is only a heuristic to simply put
the vertices on a virtual grid (i, j). See Figure 14.10 for a concrete example.
Let
col(B) = {q = q1 q2 . . . qc | Π(qi ) = Ci , for 1 ≤ i ≤ c}.
The following observation is crucial to the scheme:
If q ∈ col(B), then q ∈ B.
Note that col(B) does miss a few extensions of B, since the vertices of each
column are in strict proximity. Thus
Let
qi = v1 v2 . . . vni ,
row(B) = q1 ⊕ q2 ⊕ . . . ⊕ qc .
for 1 ≤ i ≤ c
Again, the following observation is crucial to the scheme:
If q ∈ B, then q ∈ row(B).
Note that each q ∈ B must also belong to row(B), since no order of the
elements is violated. However, some q ∈ row(B), may violate the order, since
there are some edges that go across the R rows, which is not captured by the
row(B) definition. Thus
Cex(B) ⊆ row(B). (14.2)
Also, the size of col(B) is computed exactly as follows (see Exercise 187 for
details of this computation):
n1 + n2 n1 + n2 + n3 n1 + n2 + .. + nr
|row(B)| = ... .
n2 n3 nr
In conclusion,
col(B) ⊆ Cex(B) ⊆ row(B).
For a nondegenerate partial order B,
|V |!
|col(B)| ≤ |Cex(B)| ≤ min |row(B)|, ,
2
where V is the set of vertices in the DAG of B. Thus the sizes of row(B)
and col(B) can be used as coarse lower and upper bounds of |Cex(B)|. It is
possible to refine the bounds by trying out different assignments of row(v)
(note that for a v, col(v) is unique).
1, 2, . . . , |F |
1, 2, . . . , |F |.
See Section (5.2.3) for a definition of random permutation. Then the proba-
bility, pr(B), of the occurrence of the event
q ∈ Cex(B)
is given by
|Cex(B)|
pr(B) = .
|F |!
14.5 Redescriptions
We have already developed the vocabulary to appreciate a very interesting
idea called redescriptions, introduced by Naren Ramakrishnan and coauthors
in [RKM+ 04]. This can be simply understood as follows. For a given incidence
matrix I, if distinct expressions
e1 6= e2
Π(e1 ) ∩ Π(e2 ) = ∅,
then the implications are even stronger since the alternative description or
explanation is over a different set of features (or columns).
A redescription is hence a shift-of-vocabulary; the goal of redescription min-
ing is to find segments of the input that afford dual definitions and to find
those definitions. For example, redescription may suggest alternative path-
ways of signal transduction that might target the same set of genes. However,
the underlying premise is that input sequences can indeed be characterized
in at least two ways using some definition (say boolean expression or partial
orders or partial orders of expressions). Thus for instance in cis-regulatory
494 Pattern Discovery in Bioinformatics: Theory & Algorithms
Hence, although cis-regulatory regions are very short (≈ 5-15 base pairs),
they can co-occur in symbiotic, compensatory, or antagonistic combinations.
Hence, characterizing permutation and spacing constraints underlying a fam-
ily of transcription factors can possibly help in understanding how genes are
selectively targeted for expression in a given cell state.
14.7 Summary
14.8 Exercises
Exercise 175 (Boolean expression) Consider the following incidence ma-
trix:
m1 m2 m3 m4 m5
1 0 1 1 0
I= 1 1 0 1 0
0 0 1 0 0
1 1 0 1 0
1. Construct boolean expressions, e1 , e2 and e3 on the four motifs where
each satisfies the equation below.
2. Show that for any subset, Z, of {1, 2, 3, 4}, there exists e such that
O(e) = Z.
m1 , m2 , . . . , m5
defined bu I.
Hint: 2. For each row i construct an expression ei and then e is constructed
from this collection of ei ’s.
m1 + m2 = m1 m2 ,
m1 m2 = m1 + m2 .
2. Similarly, show that if the value of any variable in Π(e) is changed from
1 to 0, then the value of e only ‘decreases’, i.e.,
Exercise 178 (Flip lemma) For a given incidence matrix I, let a polymor-
phic S be defined as follows:
S[1] = {i | i 6∈ S[1]},
S[2] = {f | f ∈ S[2]}.
If
e1 = m1 m2 . . . ml
and the pair
S[1] = Π(e1 ) and S[2] = O(e1 )
is a minimal (maximal resp.) bicluster in I then the pair
e2 = m1 + m2 + . . . + ml .
498 Pattern Discovery in Bioinformatics: Theory & Algorithms
|P rm(F )|
is an even number.
Hint: Note that |P rm(F )| = |F |!
Yet another argument is by noticing that for |F | > 1, a permutation q 6= q.
Exercise 182 (On transitive reduction) Show that the following two state-
ments are equivalent.
Let B be a partial order defined on F .
Expression & Partial Order Motifs 499
m1 m2
m1 , m2 ∈ F,
G(F, E)
(m1 m2 ) ∈ E,
1.
m1 precedes m and
m∈F = ∅.
m precedes m2 , in each qi
and
2. m1 precedes m2 .
500 Pattern Discovery in Bioinformatics: Theory & Algorithms
1. Identify the edge sets E ′ and E ′′ at each step in the example shown in
Figure 14.3.
2 1 2 1
S E E S
3 4 3 4
Partial order B Partial order B
1. Let
S2 = {q | q = q1 ⊕ q2 }.
Show that
n1 + n2 n1 + n2
|S2 | = = .
n1 n2
2. Let
Sr = {q | q = q1 ⊕ q2 ⊕ . . . ⊕ qr }.
Show that
n1 + n2 n1 + n2 + n3 n1 + n2 + .. + nr
|Sr | = ... .
n2 n3 nr
Expression & Partial Order Motifs 501
S2 d S3 f
E3
a b c d e fg e E2 g
2. How many distinct DAGs? m ? How does excess reduce with the increase
in number of DAGs? What criterion to use?
Exercise 189 (Unique degenerate) Show that the only nonempty partial
order that is degenerate (Cex(B) = Cex(B)) is of the following form:
1
2
S E
3
4
Hint: 1. The nodes can be simply relabeled to make the DAGs identical. 2.
Follows from 1.
503
504 References
[BL76] K. Booth and G. Leukar. Testing for the consecutive ones prop-
erty, interval graphs, and graph planarity using PQ-tree algo-
rithms. Journal of Computer and System Sciences, 13:335–379,
1976.
[Bsh95] N.H. Bshouty. Exact Learning Boolean Functions via the Mono-
tone Theory. Information and Computation, Vol. 123(1):146–
153, 1995.
[BT02] Jeremy Buhler and Martin Tompa. Finding motifs using random
projections. In Journal of Computational Biology, volume 9(2),
pages 225—242, 2002.
[SSDZ05] A.D. Smith, P. Sumazin, D. Das, and M.Q. Zhang. Mining chip-
chip data for transcription factor and cofactor binding sites.
Bioinformatics, 21 (Suppl1):403–412, 2005.
515
516 Pattern Discovery in Bioinformatics: Theory & Algorithms
R, 417
Sample Mean & Variance Theorem,
Ramsey theory, 89
74
random
saturation, 151, 185, 187, 188, 258
permutation, 134, 346
score
string, 135
performance, 245
variable, 48, 57
binomial, 62 solution coverage, 246
exponential, 209 scoring matrix
Poisson, 64 BLOSUM, 233
product, 58 Dayhoff, 233
sum, 58 PAM, 233
rat-human data, 308 sequence
RC Intervals Extraction Algorithm, motif, 104
282 pattern, 102
real sequences, 167 set
realization, 164 boolean closure, 423, 460
recombination pattern, 106 contained, 299
redescription, 104, 493 disjoint, 299
reduced partial order, 420 intersection closure, 423
size, 458 nested, 299, 334, 335
Reduced-graph Lemma, 481 overlap, 299
reducible partial order graph, 419
interval, 295 partitive, 320
matrix, 124 relation
redundant, 93, 109, 160, 165, 176 contained, 418
string pattern, 180 disjoint, 418
relation nested, 333, 418
=δ , 91 overlap, 419
=r , 167 straddle, 418, 460
≺, 185 union closure, 424
, 175 short interspersed nuclear elements,
, 149, 166–168 114
antisymmetric, 480 short tandem repeat, 114
Index 525
union
closure, 424
of events, 52
unique pattern, 108
Vandermonde’s identity, 63
variable number tandem repeats,
114
Viterbi Algorithm, 130, 132
VNTR, 114
Y-STR, 114