0% found this document useful (0 votes)
33 views

Pattern Discovery in Bioinformatics (L. Parida)

Uploaded by

engineersherif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Pattern Discovery in Bioinformatics (L. Parida)

Uploaded by

engineersherif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 542

Chapman & Hall/CRC Mathematical and Computational Biology Series

Pattern Discovery
in Bioinformatics
Theory & Algorithms
CHAPMAN & HALL/CRC
Mathematical and Computational Biology Series

Aims and scope:


This series aims to capture new developments and summarize what is known over the whole
spectrum of mathematical and computational biology and medicine. It seeks to encourage the
integration of mathematical, statistical and computational methods into biology by publishing
a broad range of textbooks, reference works and handbooks. The titles included in the series are
meant to appeal to students, researchers and professionals in the mathematical, statistical and
computational sciences, fundamental biology and bioengineering, as well as interdisciplinary
researchers involved in the field. The inclusion of concrete examples and applications, and
programming techniques and examples, is highly encouraged.

Series Editors
Alison M. Etheridge
Department of Statistics
University of Oxford

Louis J. Gross
Department of Ecology and Evolutionary Biology
University of Tennessee

Suzanne Lenhart
Department of Mathematics
University of Tennessee

Philip K. Maini
Mathematical Institute
University of Oxford

Shoba Ranganathan
Research Institute of Biotechnology
Macquarie University

Hershel M. Safer
Weizmann Institute of Science
Bioinformatics & Bio Computing

Eberhard O. Voit
The Wallace H. Couter Department of Biomedical Engineering
Georgia Tech and Emory University

Proposals for the series should be submitted to one of the series editors above or directly to:
CRC Press, Taylor & Francis Group
24-25 Blades Court
Deodar Road
London SW15 2NU
UK
Published Titles

Cancer Modelling and Simulation


Luigi Preziosi
Computational Biology: A Statistical Mechanics Perspective
Ralf Blossey
Computational Neuroscience: A Comprehensive Approach
Jianfeng Feng
Data Analysis Tools for DNA Microarrays
Sorin Draghici
Differential Equations and Mathematical Biology
D.S. Jones and B.D. Sleeman
Exactly Solvable Models of Biological Invasion
Sergei V. Petrovskii and Bai-Lian Li
Introduction to Bioinformatics
Anna Tramontano
An Introduction to Systems Biology: Design Principles of Biological Circuits
Uri Alon
Knowledge Discovery in Proteomics
Igor Jurisica and Dennis Wigle
Modeling and Simulation of Capsules and Biological Cells
C. Pozrikidis
Niche Modeling: Predictions from Statistical Distributions
David Stockwell
Normal Mode Analysis: Theory and Applications to Biological and
Chemical Systems
Qiang Cui and Ivet Bahar
Pattern Discovery in Bioinformatics: Theory & Algorithms
Laxmi Parida
Stochastic Modelling for Systems Biology
Darren J. Wilkinson
The Ten Most Wanted Solutions in Protein Bioinformatics
Anna Tramontano
Chapman & Hall/CRC Mathematical and Computational Biology Series

Pattern Discovery
in Bioinformatics
Theory & Algorithms

Laxmi Parida

Boca Raton London New York

Chapman & Hall/CRC is an imprint of the


Taylor & Francis Group, an informa business
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487‑2742
© 2008 by Taylor & Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works


Printed in the United States of America on acid‑free paper
10 9 8 7 6 5 4 3 2 1

International Standard Book Number‑13: 978‑1‑58488‑549‑8 (Hardcover)

This book contains information obtained from authentic and highly regarded sources. Reprinted
material is quoted with permission, and sources are indicated. A wide variety of references are
listed. Reasonable efforts have been made to publish reliable data and information, but the author
and the publisher cannot assume responsibility for the validity of all materials or for the conse‑
quences of their use.

No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any
electronic, mechanical, or other means, now known or hereafter invented, including photocopying,
microfilming, and recording, or in any information storage or retrieval system, without written
permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.
copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC)
222 Rosewood Drive, Danvers, MA 01923, 978‑750‑8400. CCC is a not‑for‑profit organization that
provides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.

Library of Congress Cataloging‑in‑Publication Data

Parida, Laxmi.
Pattern discovery in bioinformatics / Laxmi Parida.
p. ; cm. ‑‑ (Chapman & Hall/CRC mathematical and computational biology
series)
Includes bibliographical references and index.
ISBN‑13: 978‑1‑58488‑549‑8 (alk. paper)
ISBN‑10: 1‑58488‑549‑1 (alk. paper)
1. Bioinformatics. 2. Pattern recognition systems. I. Title. II. Series: Chapman
and Hall/CRC mathematical & computational biology series.
[DNLM: 1. Computational Biology‑‑methods. 2. Pattern Recognition,
Automated. QU 26.5 P231p 2008]

QH324.2.P373 2008
572.80285‑‑dc22 2007014582

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Dedicated to Ma and Bapa
Contents

1 Introduction 1
1.1 Ubiquity of Patterns . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivations from Biology . . . . . . . . . . . . . . . . . . . 2
1.3 The Need for Rigor . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Who is a Reader of this Book? . . . . . . . . . . . . . . . . 3
1.4.1 About this book . . . . . . . . . . . . . . . . . . . . 4

I The Fundamentals 7
2 Basic Algorithmics 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Tree Problem 1: Minimum Spanning Tree . . . . . . . . . . 14
2.3.1 Prim’s algorithm . . . . . . . . . . . . . . . . . . . 17
2.4 Tree Problem 2: Steiner Tree . . . . . . . . . . . . . . . . . 21
2.5 Tree Problem 3: Minimum Mutation Labeling . . . . . . . 22
2.5.1 Fitch’s algorithm . . . . . . . . . . . . . . . . . . . 23
2.6 Storing & Retrieving Elements . . . . . . . . . . . . . . . . 27
2.7 Asymptotic Functions . . . . . . . . . . . . . . . . . . . . . 30
2.8 Recurrence Equations . . . . . . . . . . . . . . . . . . . . . 32
2.8.1 Counting binary trees . . . . . . . . . . . . . . . . . 34
2.8.2 Enumerating unrooted trees (Prüfer sequence) . . . 36
2.9 NP-Complete Class of Problems . . . . . . . . . . . . . . . 40
2.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Basic Statistics 47
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Basic Probability . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Probability space foundations . . . . . . . . . . . . 48
3.2.2 Multiple events (Bayes’ theorem) . . . . . . . . . . 50
3.2.3 Inclusion-exclusion principle . . . . . . . . . . . . . 51
3.2.4 Discrete probability space . . . . . . . . . . . . . . 54
3.2.5 Algebra of random variables . . . . . . . . . . . . . 57
3.2.6 Expectations . . . . . . . . . . . . . . . . . . . . . . 58
3.2.7 Discrete probability distribution (binomial, Poisson) 60
3.2.8 Continuous probability distribution (normal) . . . . 64
3.2.9 Continuous probability space (Ω is R) . . . . . . . 66
3.3 The Bare Truth about Inferential Statistics . . . . . . . . . 69
3.3.1 Probability distribution invariants . . . . . . . . . . 70
3.3.2 Samples & summary statistics . . . . . . . . . . . . 72
3.3.3 The central limit theorem . . . . . . . . . . . . . . 77
3.3.4 Statistical significance (p-value) . . . . . . . . . . . 80
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4 What Are Patterns? 89


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Common Thread . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Pattern Duality . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.1 Operators on p . . . . . . . . . . . . . . . . . . . . 92
4.4 Irredundant Patterns . . . . . . . . . . . . . . . . . . . . . 92
4.4.1 Special case: maximality . . . . . . . . . . . . . . . 93
4.4.2 Transitivity of redundancy . . . . . . . . . . . . . . 95
4.4.3 Uniqueness property . . . . . . . . . . . . . . . . . 95
4.4.4 Case studies . . . . . . . . . . . . . . . . . . . . . . 96
4.5 Constrained Patterns . . . . . . . . . . . . . . . . . . . . . 99
4.6 When is a Pattern Specification Nontrivial? . . . . . . . . . 99
4.7 Classes of Patterns . . . . . . . . . . . . . . . . . . . . . . . 100
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

II Patterns on Linear Strings 111

5 Modeling the Stream of Life 113


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Modeling a Biopolymer . . . . . . . . . . . . . . . . . . . . 113
5.2.1 Repeats in DNA . . . . . . . . . . . . . . . . . . . . 114
5.2.2 Directionality of biopolymers . . . . . . . . . . . . . 115
5.2.3 Modeling a random permutation . . . . . . . . . . . 117
5.2.4 Modeling a random string . . . . . . . . . . . . . . 119
5.3 Bernoulli Scheme . . . . . . . . . . . . . . . . . . . . . . . . 120
5.4 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4.1 Stationary distribution . . . . . . . . . . . . . . . . 123
5.4.2 Computing probabilities . . . . . . . . . . . . . . . 127
5.5 Hidden Markov Model (HMM) . . . . . . . . . . . . . . . . 128
5.5.1 The decoding problem (Viterbi algorithm) . . . . . 130
5.6 Comparison of the Schemes . . . . . . . . . . . . . . . . . . 133
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6 String Pattern Specifications 139
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.3 Solid Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.3.1 Maximality . . . . . . . . . . . . . . . . . . . . . . . 144
6.3.2 Counting the maximal patterns . . . . . . . . . . . 144
6.4 Rigid Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.4.1 Maximal rigid patterns . . . . . . . . . . . . . . . . 150
6.4.2 Enumerating maximal rigid patterns . . . . . . . . 152
6.4.3 Density-constrained patterns . . . . . . . . . . . . . 156
6.4.4 Quorum-constrained patterns . . . . . . . . . . . . 157
6.4.5 Large-|Σ| input . . . . . . . . . . . . . . . . . . . . 158
6.4.6 Irredundant patterns . . . . . . . . . . . . . . . . . 160
6.5 Extensible Patterns . . . . . . . . . . . . . . . . . . . . . . 164
6.5.1 Maximal extensible patterns . . . . . . . . . . . . . 165
6.6 Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.6.1 Homologous sets . . . . . . . . . . . . . . . . . . . . 165
6.6.2 Sequence on reals . . . . . . . . . . . . . . . . . . . 167
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

7 Algorithms & Pattern Statistics 183


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.2 Discovery Algorithm . . . . . . . . . . . . . . . . . . . . . . 183
7.3 Pattern Statistics . . . . . . . . . . . . . . . . . . . . . . . . 191
7.4 Rigid Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.5 Extensible Patterns . . . . . . . . . . . . . . . . . . . . . . 193
7.5.1 Nondegenerate extensible patterns . . . . . . . . . . 194
7.5.2 Degenerate extensible patterns . . . . . . . . . . . . 196
7.5.3 Correction factor for the dot character . . . . . . . 197
7.6 Measure of Surprise . . . . . . . . . . . . . . . . . . . . . . 198
7.6.1 z-score . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.6.2 χ-square ratio . . . . . . . . . . . . . . . . . . . . . 199
7.6.3 Interplay of combinatorics & statistics . . . . . . . 200
7.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

8 Motif Learning 213


8.1 Introduction: Local Multiple Alignment . . . . . . . . . . . 213
8.2 Probabilistic Model: Motif Profile . . . . . . . . . . . . . . 214
8.3 The Learning Problem . . . . . . . . . . . . . . . . . . . . . 215
8.4 Importance Measure . . . . . . . . . . . . . . . . . . . . . . 216
8.4.1 Statistical significance . . . . . . . . . . . . . . . . . 216
8.4.2 Information content . . . . . . . . . . . . . . . . . . 219
8.5 Algorithms to Learn a Motif Profile . . . . . . . . . . . . . 220
8.6 An Expectation Maximization Framework . . . . . . . . . . 222
8.6.1 The initial estimate ρ0 . . . . . . . . . . . . . . . . 222
8.6.2 Estimating z given ρ . . . . . . . . . . . . . . . . . 223
8.6.3 Estimating ρ given z . . . . . . . . . . . . . . . . . 224
8.7 A Gibbs Sampling Strategy . . . . . . . . . . . . . . . . . . 227
8.7.1 Estimating ρ given an alignment . . . . . . . . . . . 227
8.7.2 Estimating background probabilities given Z . . . . 228
8.7.3 Estimating Z given ρ . . . . . . . . . . . . . . . . . 228
8.8 Interpreting the Motif Profile in Terms of p . . . . . . . . . 229
8.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

9 The Subtle Motif 235


9.1 Introduction: Consensus Motif . . . . . . . . . . . . . . . . 235
9.2 Combinatorial Model: Subtle Motif . . . . . . . . . . . . . 236
9.3 Distance between Motifs . . . . . . . . . . . . . . . . . . . . 238
9.4 Statistics of Subtle Motifs . . . . . . . . . . . . . . . . . . . 240
9.5 Performance Score . . . . . . . . . . . . . . . . . . . . . . . 245
9.6 Enumeration Schemes . . . . . . . . . . . . . . . . . . . . . 246
9.6.1 Neighbor enumeration (exact) . . . . . . . . . . . . 246
9.6.2 Submotif enumeration (inexact) . . . . . . . . . . . 249
9.7 A Combinatorial Algorithm . . . . . . . . . . . . . . . . . . 252
9.8 A Probabilistic Algorithm . . . . . . . . . . . . . . . . . . . 255
9.9 A Modular Solution . . . . . . . . . . . . . . . . . . . . . . 257
9.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
9.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

III Patterns on Meta-Data 263


10 Permutation Patterns 265
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 265
10.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . 266
10.2 How Many Permutation Patterns? . . . . . . . . . . . . . . 267
10.3 Maximality . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10.3.1 P=1 : Linear notation & PQ trees . . . . . . . . . . 269
10.3.2 P>1 : Linear notation? . . . . . . . . . . . . . . . . 271
10.4 Parikh Mapping-based Algorithm . . . . . . . . . . . . . . . 273
10.4.1 Tagging technique . . . . . . . . . . . . . . . . . . . 275
10.4.2 Time complexity analysis . . . . . . . . . . . . . . . 275
10.5 Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.5.1 The naive algorithm . . . . . . . . . . . . . . . . . . 280
10.5.2 The Uno-Yagiura RC algorithm . . . . . . . . . . . 281
10.6 Intervals to PQ Trees . . . . . . . . . . . . . . . . . . . . . 294
10.6.1 Irreducible intervals . . . . . . . . . . . . . . . . . . 295
10.6.2 Encoding intervals as a PQ tree . . . . . . . . . . . 297
10.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 307
10.7.1 Case study I: Human and rat . . . . . . . . . . . . 308
10.7.2 Case study II: E. Coli K-12 and B. Subtilis . . . . . 309
10.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

11 Permutation Pattern Probabilities 323


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 323
11.2 Unstructured Permutations . . . . . . . . . . . . . . . . . . 323
11.2.1 Multinomial coefficients . . . . . . . . . . . . . . . . 325
11.2.2 Patterns with multiplicities . . . . . . . . . . . . . . 328
11.3 Structured Permutations . . . . . . . . . . . . . . . . . . . 329
11.3.1 P -arrangement . . . . . . . . . . . . . . . . . . . . 330
11.3.2 An incremental method . . . . . . . . . . . . . . . . 331
11.3.3 An upper bound on P -arrangements∗∗ . . . . . . . 336
11.3.4 A lower bound on P -arrangements . . . . . . . . . 341
11.3.5 Estimating the number of frontiers . . . . . . . . . 342
11.3.6 Combinatorics to probabilities . . . . . . . . . . . . 345
11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346

12 Topological Motifs 355


12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 355
12.1.1 Graph notation . . . . . . . . . . . . . . . . . . . . 355
12.2 What Are Topological Motifs? . . . . . . . . . . . . . . . . 356
12.2.1 Combinatorics in topologies . . . . . . . . . . . . . 357
12.2.2 Input with self-isomorphisms . . . . . . . . . . . . . 358
12.3 The Topological Motif . . . . . . . . . . . . . . . . . . . . . 359
12.3.1 Maximality . . . . . . . . . . . . . . . . . . . . . . . 367
12.4 Compact Topological Motifs . . . . . . . . . . . . . . . . . 369
12.4.1 Occurrence-isomorphisms . . . . . . . . . . . . . . . 369
12.4.2 Vertex indistinguishability . . . . . . . . . . . . . . 372
12.4.3 Compact list . . . . . . . . . . . . . . . . . . . . . . 373
12.4.4 Compact vertex, edge & motif . . . . . . . . . . . . 373
12.4.5 Maximal compact lists . . . . . . . . . . . . . . . . 374
12.4.6 Conjugates of compact lists . . . . . . . . . . . . . 374
12.4.7 Characteristics of compact lists . . . . . . . . . . . 378
12.4.8 Maximal operations on compact lists . . . . . . . . 380
12.4.9 Maximal subsets of location lists . . . . . . . . . . . 381
12.4.10 Binary relations on compact lists . . . . . . . . . . 384
12.4.11 Compact motifs from compact lists . . . . . . . . . 384
12.5 The Discovery Method . . . . . . . . . . . . . . . . . . . . . 392
12.5.1 The algorithm . . . . . . . . . . . . . . . . . . . . . 393
12.6 Related Classical Problems . . . . . . . . . . . . . . . . . . 399
12.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 400
12.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
12.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
13 Set-Theoretic Algorithmic Tools 417
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 417
13.2 Some Basic Properties of Finite Sets . . . . . . . . . . . . . 418
13.3 Partial Order Graph G(S, E) of Sets . . . . . . . . . . . . . 419
13.3.1 Reduced partial order graph . . . . . . . . . . . . . 420
13.3.2 Straddle graph . . . . . . . . . . . . . . . . . . . . . 421
13.4 Boolean Closure of Sets . . . . . . . . . . . . . . . . . . . . 423
13.4.1 Intersection closure . . . . . . . . . . . . . . . . . . 423
13.4.2 Union closure . . . . . . . . . . . . . . . . . . . . . 424
13.5 Consecutive (Linear) Arrangement of Set Members . . . . . 426
13.5.1 PQ trees . . . . . . . . . . . . . . . . . . . . . . . . 426
13.5.2 Straddling sets . . . . . . . . . . . . . . . . . . . . . 429
13.6 Maximal Set Intersection Problem (maxSIP) . . . . . . . . 434
13.6.1 Ordered enumeration trie . . . . . . . . . . . . . . . 435
13.6.2 Depth first traversal of the trie . . . . . . . . . . . . 436
13.7 Minimal Set Intersection Problem (minSIP) . . . . . . . . . 447
13.7.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . 447
13.7.2 Minimal from maximal sets . . . . . . . . . . . . . 448
13.8 Multi-Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
13.8.1 Ordered enumeration trie of multi-sets . . . . . . . 451
13.8.2 Enumeration algorithm . . . . . . . . . . . . . . . . 453
13.9 Adapting the Enumeration Scheme . . . . . . . . . . . . . . 455
13.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458

14 Expression & Partial Order Motifs 469


14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 469
14.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 470
14.2 Extracting (Monotone CNF) Boolean Expressions . . . . . 471
14.2.1 Extracting biclusters . . . . . . . . . . . . . . . . . 475
14.2.2 Extracting patterns in microarrays . . . . . . . . . 478
14.3 Extracting Partial Orders . . . . . . . . . . . . . . . . . . . 480
14.3.1 Partial orders . . . . . . . . . . . . . . . . . . . . . 480
14.3.2 Partial order construction problem . . . . . . . . . 481
14.3.3 Excess in partial orders . . . . . . . . . . . . . . . . 483
14.4 Statistics of Partial Orders . . . . . . . . . . . . . . . . . . 485
14.4.1 Computing Cex(B) . . . . . . . . . . . . . . . . . . 489
14.5 Redescriptions . . . . . . . . . . . . . . . . . . . . . . . . . 493
14.6 Application: Partial Order of Expressions . . . . . . . . . . 494
14.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
14.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496

References 503

Index 515
Acknowledgments

I owe the completion of this book to the patience and understanding of Tuhina
at home and of friends and colleagues outside of home. I am particularly
grateful for Tuhina’s subtle, quiet cheer-leading without which this effort may
have seemed like a thankless chore.
Behind every woman is an army of men. My sincere thanks to Alberto Ap-
sotolico, Saugata Basu, Jaume Bertranpetit, Andrea Califano, Matteo Comin,
David Gilbert, Danny Hermelin, Enam Karim, Gadi Landau, Naren Rama-
krishnan, Ajay Royyuru, David Sankoff, Frank Suits, Maciej Trybilo, Steve
Oshry, Samaresh Parida, Rohit Parikh, Mike Waterman and Oren Weimann
for their, sometimes unwitting, complicity in this endeavor.
Chapter 1
Introduction

Le hasard favorise l´esprit preparé. 1


- attributed to Louis Pasteur

1.1 Ubiquity of Patterns


Major scientific discoveries have been made quite by accident: however a
closer look reveals that the scientist was intrigued by a specific pattern in
the observations. Then some diligent persuasion led to an important discov-
ery. A classic example is that of the English doctor from Gloucestershire,
England, by the name of Edward Jenner. His primary observation was that
milkmaids were immune to smallpox even though other family members would
be infected with the disease. The milkmaids were routinely exposed to cow-
pox and subsequently Jenner’s successful experiment of inducing immunity to
smallpox in a little boy by first infecting him with cowpox led to the world’s
first smallpox vaccination. A sharp observation in 1796 ultimately led to the
eradication of smallpox on this planet in the late 1970s.
A more recent story (1997) that has caught the attention of scientists and
media alike is the story of a group of Nairobi women, immune to AIDS. While
researchers pondered the possibility of these women acquiring immunity from
the environment (like the case of cows for smallpox), a chance conversation of
the attending doctor Dr. Joshua Kimani with the immune patients revealed
that about half of them were close relatives. This sent the doctors scrambling
to look for genetic similarity and the discovery of the presence of ‘killer’ T-
cells in the immune system of these women. This has led researchers in the
path of exploring a vaccine for AIDS.
Stories abound in our scientific history to suggest that these chance obser-
vations are key starting points in the process of major discoveries.

1 Chance favors the prepared mind.

1
2 Pattern Discovery in Bioinformatics: Theory & Algorithms

1.2 Motivations from Biology


The biology community is inundated with a large amount of data, such as
the genome sequences of organisms, microarray data, interactions data such
as gene-protein interactions, protein-protein interactions, and so on. This
volume is rapidly increasing and the process of understanding the data is
lagging behind the process of acquiring it. The sheer enormity of this calls for
a systematic approach to understanding using (quantitative) computational
methods. An inevitable first step towards making sense of the data is to study
the regularities and hypothesize that this reveals vital information towards a
greater understanding of the underlying biology that produced this data.
In this compilation we explore various modes of regularities in the data:
string patterns, patterned clusters, permutation patterns, topological pat-
terns, partial order patterns, boolean expression patterns and so on. Each
class captures a different form of regularity in the data enabling us to provide
possible answers to a wide range of questions.

1.3 The Need for Rigor


Unmistakeably, the nature of the subject of biology has changed in the last
decades: the transition has been from ‘what’ to ‘how’.
Just as a computer scientist or a mathematician or a physicist needs to have
a fair understanding of biology to pose meaningful questions or provide useful
answers, so does a biologist need to have an understanding of the computa-
tional methods to employ them correctly and provide the correct answers to
the difficult questions.
While the easy availability of search engines makes access to exotic as well
as simple-minded systems very easy, it is unclear that this is always a step
forward. The burden is on the user to understand how the methods work,
what problems they solve and how correct are the offered answers. This book
aims at clarifying some of the mist that may accompany such encounters.
One of the features of the treatment in the book is that in each case, we
attack the problem in a model-less manner. Why is this any good? As ap-
plication domains change the underlying models change. Often our existing
understanding of the domain is so inadequate that coming up with an ap-
propriate model is difficult and sometimes even misleading. So much so that
there is little consensus amongst researchers about the correct model. This has
prompted many to resort to a model-less approach. Often the correct model
can be used to refine the existing system or appropriately pre- or post-process
the data. The model-less approach is not to be misconstrued as neglect of
Introduction 3

the domain specifications, but a tacit acknowledgment that each domain de-
serves much closer attention and elaborate treatment that goes well beyond
the scope of this book. Also, this approach compels us to take a hard look at
the problem domain, often giving rise to elegant and clean definitions with a
sound mathematical foundation as well as efficient algorithms.

1.4 Who is a Reader of this Book?


This book is intended for biologists as well as for computer scientists and
mathematicians. In fact it is aimed at anyone who wants to understand the
workings and implications of pattern discovery. More precisely, to appreciate
the contents of this book, it is sufficient to be familiar with the prerequisites
of a regular bioinformatics course.
Often some readers are turned off by the use of terms such as ‘theorem’
or ‘lemma’, however the book does make use of these. Let me spend a few
words justifying the use of these words and at the same time encouraging
the community to embrace this vocabulary. Loosely speaking, a theorem is a
statement accepted or proposed as a demonstrable truth. Once proven, the
validity of the statement is unquestioned. It provides a means for organizing
one’s thoughts in a reusable manner. This mechanism that the mathematical
sciences has to offer is so compelling that it will be a mistake not to assimilate
it in this collection of logical thoughts.
Clearly, it is easier to have a theorem in mathematics than in the physical
sciences. Most of the theorems in this book are to be simply viewed as concise
factual statements. And, the proof is merely a justification for the claims.
Lemmas, though traditionally used as supporting statements for theorems, is
used here for simpler claims. All the proofs in this book require the logical
thinking at the level of a college freshman, albeit a motivated one.
The proofs of the theorems and lemmas are given for a curious and sus-
picious reader. However, no continuity in thought is lost by skipping the
proofs.
If I could replace a theorem with an example, I did. If I could replace an
exposition with an exercise, I did. An illustrative example is worth a thousand
words and an instructive exercise is worth a thousand paragraphs. I have made
heavy use of these two tools throughout the book in an attempt to convey the
underlying ideas.
The body of the chapter and the exercise problems accompanying it have
a spousal relationship: one is incomplete without the other. These problems
are not meant to test the reader but provide supplemental points for thought.
Each exercise is designed carefully and simply requires ‘connecting the dots’
on part of the reader, while the body of the chapters explains the ‘dots’. The
4 Pattern Discovery in Bioinformatics: Theory & Algorithms

∗∗
challenging exercises (and sections) are marked with .

1.4.1 About this book


Consider the scenario of getting a sculptor and a painter together to pro-
duce a work of art. Usually, the sculptor will not paint and the painter will
not sculpt, however the results (if any) of their synergy could be incredibly
spectacular!
Interdisciplinary areas such as bioinformatics must deal with such issues
where each expert uses a different language. Establishing a common vocab-
ulary across disciplines is ideal but has not been very realistic. Even sub-
disciplines such as ‘systems’, ‘computational logic’, ‘artificial intelligence’ and
so on, within the umbrella discipline of computer science, have known to have
developed their very own dialects. Many seasoned researchers may also have
witnessed in their lifetimes the rediscovery of the same theorems in different
contexts or disciplines.
Sometimes, the problem is compounded by the fact that the sculptor dab-
bles in paint and the painter in clay. I believe I am cognizant of the agony
and the ecstasy of cross-disciplinary areas. Yet I write these chapters.

Roadmap of the book. The book is organized in three parts. Part I


provides the prerequisites for the remainder of the book. Chapters 2 and
3 are designed as journeys, with a hypothetical but skeptical biologist, that
takes the reader through the corridors of algorithms and statistics. We follow
a story-line, and the ideas from these areas are presented on a need-to-know
basis for a leery reader.
Chapter 4 discusses the connotation of patterns used in this book. The
nuance of repetitiveness that is associated with patterns in this book is not
universal and we reorient the reader to this view through this chapter.
Part II of the book focuses on patterns on linear (string) data. Chap-
ter 5 discusses possible statistical models for such data. Patterns on strings
are conceptually the simplest and the ramifications of these are discussed in
Chapters 6 and 7.
String patterns are simple, yet complex! This is explored in the follow-
ing two chapters. Chapter 8 discusses different (probabilistic) motif learning
methods where the pattern or motif is a consequence of local multiple align-
ment. Chapter 9 focuses on methods, primarily combinatorial, where the
pattern or motif is viewed as the (inexact) consensus of multiple occurrences.
Part III of the book deals with patterns that have more sophisticated spec-
ifications. The complexity is in the characterizations but not necessarily in
the implications. A string pattern on DNA may have as strong a repercussion
as any other.
Permutation patterns are mathematically elegant, algorithmically interest-
ing, statistically challenging and biologically meaningful as well! A well-
Introduction 5

rounded topic as this is always a delight and is discussed in Chapters 10


and 11.
Topological or network motifs are common structures in graph data. This
area of motifs is relatively new and I have met more researchers who are
skeptical about the import of these than any other. I give a slightly unusual
treatment, in the sense that it is not based on graph traversals as is usually
the case, of this problem in Chapter 12.
Chapter 13 is an attempt at identifying some common tools that could be
utilized in most areas of pattern discovery. These include mainly structures
and operations on finite sets.
The book concludes with a discussion of even more exotic pattern character-
izations in the form of boolean expressions and partial orders in Chapter 14.
Part I

The Fundamentals
Chapter 2
Basic Algorithmics

The worth of an artist is measured


by the sharpness of her tools.
- based on a Chinese proverb

2.1 Introduction
To keep the book self-contained, in this chapter we review the basic mathe-
matics and algorithmics that is required to understand and appreciate the ma-
terial in the remainder of the book. To give a context to some of the abstract
ideas involved, we follow a storyline. This is also an exercise in understanding
an application, abstracting the essence of the task at hand, formulating the
computational problem and designing and analyzing an algorithm for solving
the computational problem. The last (but certainly not the least) part of
the task is to implement the algorithm and analyze its performance in a real
world setting.
Consider the following scenario. Professor Marwin has come across a gem
of a bacterium: a chromo-bacterium that is identified by its color and every
time a mutation occurs, it takes on a new color. Leaving her bacteria culture
with a generous source of nutrition, she proceeds on a two week vacation. On
her return, she is confronted with K distinct colored strains of bacteria. A
closer look reveals that each bacterium has a genome of size m and with only
mutational differences between the genomes. She is intrigued by this and is
eager to reconstruct the evolution of the bacteria in her absence.

2.2 Graphs
A graph is a common abstraction that is used to model binary relationships
between objects. More precisely, a graph G is a pair (V, E), consisting of a
set V and a subset
E ⊂ (V × V )

9
10 Pattern Discovery in Bioinformatics: Theory & Algorithms

and we denote the graph by


G(V, E).
The elements of V are called nodes or vertices and those of E are called the
edges of the graph. A graph is finite if its set of vertices (and hence edges)
is finite. In this book all graphs considered will be finite. Further, each edge
can be annotated with labels or numerical quantities. The latter is usually
called the weight of the edge.
Professor Marwin chooses to use graphs to model her bacteria strains and
their interrelationships. Quite naturally, she models the strains as the nodes
in the graph and the edges between the nodes as an evolutionary distance
between the two. Roughly speaking, she is seeking the simplest explanation
of evolution of the bacterial strains.
However, one cannot even begin to seek a method or algorithm before defin-
ing the problem. Of course, problems in biology are usually difficult: they
may not even have a precise mathematical definition. But it is essential to
recognize what can and what can not be done. This goes a long way in fo-
cussing on the correct interpretations and avoiding any misrepresentation of
the results.
A method or an algorithm can only be defined for a precise problem P .
However P may be an approximation of the real biological problem at hand.
So Professor Marwin gets down to the most important task and that is of
recognizing the essence of the computational task. She identifies the following
problem:

Problem 1 (Connected graph with minimum weight) Given distinct


sequences
si , 1 ≤ i ≤ K,
each of size m, let dij be the number of base differences between sequences si
and sj . Consider a weighted graph

G(V, E) with |V | = K

and edge weights


wt(vi vj ) = dij
where
vi , vj ∈ V and (vi vj ) ∈ E.
Weight of G(V, E) is defined as:
X
WG = wt(vi vj )
(vi vj )∈E

The task is to find a connected graph G∗ with minimum weight (over all pos-
sible connected graphs on V ).
Basic Algorithmics 11

s1
CGACTCGCAT 4 s3
CCATCCGCAC
2
4
5
CGACCCGCCT 5 3
s2 3
5
5
CCAGCCCCAT
2 s5
CCATTCCCAT
s4

FIGURE 2.1: The graph constructed with the data from Example (1). A
vertex vi is labeled with sequence si , 1 ≤ i ≤ 5. The weight associated with
edge (vi vj ) is the distance dij .

The edge weight dij denotes the distance between two genomes si , sj , or the
number of mutations that takes genome si to sj or vice-versa. Clearly, all
the K colored groups of bacteria that Professor Marwin observes did not
evolve independently but from the one strain that she started with. Thus
the connectivity of the graph G, where each node is a genome, represents
how the genomes evolved over a period of time. Given different choices of
graphs that explain the evolution process, the one(s) of most interest would
be the ones that have the smallest weight since that would be considered as
the simplest explanation of the process. Consider an instance of this problem
in the following example.

Example 1 Each genome is a sequence defined on the four nucleotides, Ade-


nine, Cytosine, Guanine and Thymine. Consider Problem (1) with K = 5
and the input genomic sequences as:

s1 = CGACTCGCAT
s2 = CGACCCGCCT
s3 = CCATCCGCAC
s4 = CCATTCCCAT
s5 = CCAGCCCCAT

See Figure 2.1 for the graph corresponding to this data.

Before proceeding further, we pin down a few definitions to avoid ambiguity.


Given a graph
G(V, E) and V ′ ⊆ V,
we call
G(V ′ , E ′ ),
12 Pattern Discovery in Bioinformatics: Theory & Algorithms

the subgraph induced by V ′ where E ′ is defined as

{(v1 v2 ) ∈ E | v1 , v2 ∈ V ′ }.

Subgraph
G′ (V ′ , E ′ )
is induced by E ′ when given E ′ ⊆ E, V ′ is defined as

{vi ∈ V | (vi , vj ) ∈ E ′ }.

A path in the graph G(V, E) between vertices

v1 , v2 ∈ V

is a sequence of vertices,

(w0 =v1 ), w1 , . . . , (wm =v2 )

such that for each


i, 0 ≤ i < m, (wi wi+1 ) ∈ E.
A graph is connected when there exists a path between any two pair of vertices.
A connected component of a graph is an induced subgraph on a maximal set
of vertices V ′ ⊆ V , such that the induced graph G′ (V ′ , E ′ ) is connected.
Next, a careful study of Problem (1) reveals:
1. Each vertex vi ∈ V corresponds to sequence si .
2. dij > 0 for each i 6= j, since the sequences are distinct.
Determining the structure of G: Consider G(V, E) with

E = φ.

Then
wt(G) = 0
and this is the smallest possible weight of a graph. But the graph is not con-
nected, and hence this is an incorrect solution. However this failure suggests
that the solution needs to introduce as small a number of edges as possible so
as to make the graph connected. Seeking the smallest (or simplest) explana-
tion for a problem is often called the Occam’s razor principle or the principle
of parsimony.
Another way to look at the problem is to start with a completely connected
graph
G(V, E),
i.e, each
(vi vj ) ∈ E where i 6= j,
Basic Algorithmics 13

and remove the redundant edges. Some careful thought leads to the con-
clusion, that this graph must have no cycles, i.e., no closed paths. This is
because an edge can be removed from the cycle, maintaining the connectivity
of the graph but reducing the number of edges. These observations lead to
the following definition.

DEFINITION 2.1 (tree, leaf node) A connected graph

G(V, E)

having the property that any two vertices

v1 , v2 ∈ V

has a unique path connecting them is called an acyclic graph or a tree. A


vertex with one incident edge is called a leaf node, all other vertices are called
internal nodes.

Does every tree have internal nodes? Consider a tree with only two vertices,
all the vertices have only one incident edge. Hence this tree has no internal
edge. Does every tree have a leaf node? The following lemma guarantees that
it does.

LEMMA 2.1
(mandatory leaf lemma) A tree T (V, E) must have a leaf node.

PROOF If
|V | ≤ 2,
clearly all the nodes are leaves. Assume

|V | > 2.

We give a constructive argument for this claim. Let

V = {v1 , v2 , v3 , . . . , vn }.

Consider an arbitrary
vi0 ∈ V.
If vi0 is a leaf node, we are done. Assume it is not. Consider vi1 where

(vi0 vi1 ) ∈ E.

Again if vi1 is not a leaf node, since vi1 has degree at least two, there exists

vi2 6= vi0
14 Pattern Discovery in Bioinformatics: Theory & Algorithms

with
(vi1 vi2 ) ∈ E.
Since the graph is finite and has no cycles 1 this process must terminate with
a leaf node.

We conclude that the structure of the graph we seek is a tree. Of course,


this tree must denote the smallest number of changes, again by the Occam’s
razor principle.

2.3 Tree Problem 1: Minimum Spanning Tree


It is best to start with a naive approach to obtaining the tree with minimum
weight: we simply enumerate all the trees and pick the one(s) with the least
weight. It is easy to see that this algorithm of searching the entire space of
spanning trees will yield the correct solution(s).
The next question to address is: How good is the algorithm? One of the
important criteria is the time the algorithm takes to complete the task. There
are others like the space requirements and so on, but here we focus only on
the running time.
The running time is usually computed as a function of the size of the input
to the problem. Here we have K sequences, each of size m. First, we focus
on computing the dij ’s.
Time to compute the dij ’s: We assume that reading an element takes time
cr , and a comparison operation takes time cc and any arithmetic operation
takes time ca . Each of this is a constant depending on the environment in
which they are carried out. Roughly speaking, a comparison of two sequences
of length m each involves the following operations:

1. initializing a count to 0,

2. reading two elements, one from each of the arrays,

3. comparing the two values and incrementing the counter by 0 or 1.

This takes time


ca + m(2cr + cc + ca ).
The only factor that will change with the instance of the problem is the one
that involves m, the rest being constants. The main concern is the growth

1 Consider the sequence of vertices in the traversal: vi0 vi1 vi2 vi3 . . . vik . If a vertex appears
multiple times in the traversal, then clearly the graph has a cycle.
Basic Algorithmics 15

of this function with m. To avoid clutter and focus on the most meaningful
factor, we use an asymptotic notation (see Section 2.7) and the time is written
as
ca + m(2cr + cc + ca ) = O(m)
We use a big-oh notation O(·) which is explained in Section 2.7. In other
words, all the constants are ignored. The time grows linearly with the size m
and that is all that matters.
Since there are K distinct sequences,

K(K − 1)
−1
2
comparisons are made. Again using the asymptotic notation, ignoring linear
factors (K) in favor of faster growing quadratic factors (K 2 ), the number of
comparisons is written as
O(K 2 ).
Thus since
O(K 2 )
comparisons are made and each comparison takes

O(m)

time, the time to compute the dij ’s takes time

O(K 2 m).

We first assert the following.

LEMMA 2.2
(Edge-vertex lemma) Given a connected graph G(V, E),

(G is a tree) ⇒ |E| = |V | − 1.

PROOF This is easily shown by induction on the number of vertices n.


Consider the base case, n = 2.
Clearly a tree with two vertices has only one edge, thus the result holds for
the base case. Assume the result holds for n > 2, i.e., a tree with n vertices
has (n − 1) edges.
Consider a connected tree T with (n + 1) vertices. Let T ′ be a tree with n
vertices obtained from T by deleting a leaf node v. Such a vertex (leaf node)
exists by Lemma (2.1). By the induction hypothesis, T ′ has (n − 1) edges. v
has only one edge incident on it, thus T has (n − 1) + 1 edges.
16 Pattern Discovery in Bioinformatics: Theory & Algorithms

COROLLARY 2.1
Given a tree T (V, E), removing an edge (keeping V untouched) disconnects
the tree.

Back to the algorithm. Since this algorithm involves searching the entire
space of spanning trees, we next count the total number of such trees with K
vertices given as N . This number N is given by Cayley’s formula:

N = K (K−2) .

However, this is not the end of the story, since we must devise a way to
enumerate these N configurations. This is discussed in Section 2.8.2. To
summarize, the time taken by the naive algorithm is

O(K K + mK 2 ),

a function that grows so rapidly with K that this algorithm is unacceptable


for all practical purposes! Thus though the algorithm solves the problem
correctly, its time complexity is too large for it to be of any use.
Recall that the component that finds the tree of the minimum weight in
the naive algorithm is too inefficient to be of practical use. So we next focus
on this problem identified as follows. Recall that the weight of a graph (or a
tree) is the sum of the weights of all the edges on the graph (or tree).

Problem 2 (Minimum Spanning Tree (MST)) Given a weighted graph

G(V, E)

with nonnegative weights on the edges, a spanning tree

T (V ′ , E ′ )

of G(V, E) is such that

1. V ′ = V ,

2. E ′ ⊆ E and

3. T is a tree.

The task is to find a tree


T∗

of minimum weight amongst all possible spanning trees.


Basic Algorithmics 17

2.3.1 Prim’s algorithm


The MST problem is well studied in literature [CLR90] and has a very
elegant solution. We present an algorithm below based on the following ob-
servation: If a set of edges Es disconnects the given graph, then the edge with
the smallest weight in Es must be in the MST. Formally put,

LEMMA 2.3
(Bridge lemma) Given graph G(V, E), let

Es ⊆ E

be such that the graph induced by

(E \ Es )

on G(V, E) is not connected, then an edge satisfying

wt(v1 v2 ) = min wt(vi vj )


(vi vj )∈Es

is such that
(v1 v2 ) ∈ (Es ∩ E ∗ )
where
T ∗ (V, E ∗ )
is a minimum spanning tree.

PROOF This is easily shown by contradiction. Let Es be such that it


splits the vertex set into two as

V = V1 ∪ V2

with
V1 ∩ V2 = φ.
Assume the result is not true i.e.,

(v1 v2 ) 6∈ E ∗ .

Since T ∗ is connected, it must contain an edge

(v1′ v2′ ) ∈ Es

with
wt(v1′ v2′ ) > wt(v1 v2 ).
Construct T ′ from T ∗ by deleting

(v1′ v2′ ).
18 Pattern Discovery in Bioinformatics: Theory & Algorithms

Clearly the subgraph T1′ (V1 , E1 ) induced by V1 on T ∗ is connected and acyclic


and so is the subgraph induced by V2 ,

T2′ (V2 , E2 ), where E1 , E2 ⊆ E ∗ .

Next we add the edge (v1 v2 ) to T ′ that now connects the subgraphs T1′ and
T2′ without introducing cycles since v1 ∈ V1 and v2 ∈ V2 , without loss of
generality. As a result T ′ is acyclic and connected, hence a tree. But

wt(T ′ ) < wt(T ∗ ),

which is a contradiction, hence the assumption must be wrong and

(v1 v2 ) ∈ E ∗ .

LEMMA 2.4
(Weakest link lemma) The converse of Lemma (2.3) also holds true. In
other words, given a minimum spanning tree

T ∗ (V, E ∗ )

of a graph G(V, E), consider an edge,

v1 v2 ∈ E ∗ ,

and let the two connected components of T ∗ obtained by deleting the edge
(v1 , v2 ) be
Tk∗ (Vk∗ , Ek∗ ), k = 1, 2.
Gk (V k , E k ) are the two subgraphs of G(V, E) induced by Ek∗ with

v1 ∈ V 1 and v2 ∈ V 2 .

Then
wt(v1 v2 ) = min wt(vi vj )
(vi vj )∈E,vi ∈V 1 ,vj ∈V 2

From lemma to algorithm. This lemma and its converse can be used to de-
sign a straightforward algorithm (Algorithm (1)): we progressively construct
an Es in every step as we build E ∗ and V ∗ . This algorithm is also called
Prim’s algorithm [CLR90]. It is now straightforward to prove the correctness
of the algorithm.

LEMMA 2.5
Algorithm (1) correctly computes the minimum spanning tree.
Basic Algorithmics 19

PROOF At each iteration of the algorithm

Es 6= φ

since
|V ∗ | ≤ |V |.
Thus exactly one edge is added to E ∗ . By Lemma (2.2), T ∗ has

|V | − 1

edges and the algorithm is iterated (|V | − 1) times. Thus T ∗ has (|V | − 1)
edges and by Lemma (2.3), each edge added to E ∗ is in the minimum spanning
tree, hence the algorithm is correct.

Figure 2.2 illustrates the algorithm on the graph of Example (1). Using the
asymptotic notation, Step (0) takes time

O(1).

Step 1 takes
O(|E|)
time, Step (2) takes O(1) and Step 3 takes

O(E)

time. Since Steps 1-3 are repeated |V | times, the algorithm takes time

O(|V ||E|).

The running time complexity can be improved by using efficient data struc-
tures for Step (3) of the algorithm. However, for this exposition we stay
content with time complexity of

O(|V ||E|).

Algorithm 1 Minimum Spanning Tree Algorithm


(0) E ∗ ← φ, V ∗ ← φ, Es ← E

FOR i = 1 . . . to (|V | − 1)
(1) Let wt(v1 v2 ) = min(vi vj )∈Es wt(vi vj )
(2) V ∗ ← V ∗ ∪ {v1 , v2 }, E ∗ ← E ∗ ∪ {(v1 , v2 )}
(3) Es ← {(vi vj ) | (vi ∈ V ∗ ) AND (vj 6∈ V ∗ )}
20 Pattern Discovery in Bioinformatics: Theory & Algorithms

s1 4 s1 4
s3 s3
2 2
4 4
5 5
s2 5 3 s2 5 3
3 3
5 5
5 5 s5
s5
2 2
s4 s4
(1) (2)

s1 4 s1 4
s3 s3
2 2
4 4
5 5
s2 3 s2 5 3
5
3 3
5 5
5 5 s5
s5
2 2
s4 s4
(3) (4)

FIGURE 2.2: Consider the graph of Figure 2.1. The minimum spanning
tree (MST) algorithm on this graph is shown here. (1)-(4) denote the steps
in the algorithm. At each step the edges shown in bold are the ones that
constitute the spanning tree and the collection of edges Es is shown as dashed
edges. The MST is shown by the bold edges in (4).
Basic Algorithmics 21

2.4 Tree Problem 2: Steiner Tree


We now change the problem scenario of Section 2.1 slightly. After a very
careful set of observations over a period of time, Professor Marwin notices
that once a chromo-bacterium mutates into another color, after a while the
original colored bacterium vanishes from the colonies. In other words, at any
time when K distinct colors are noticed, then there had been some more K ′
colors that are no longer present. Problem (1) is modified as follows:

Problem 3 (Steiner tree) Given distinct sequences

si , 1 ≤ i ≤ K,

each of size m each, let dij be the number of base differences between sequences
si and sj . Consider a weighted graph

G(V, E)

with
|V | ≥ K
and edge weights
wt(vi vj ) = dij where vi , vj ∈ V.
Weight of G(V, E) is defined as:
X
WG = wt(vi vj ).
(vi vj )∈E

The task is to find a minimum weight tree

T ∗ (V, E)

with K leaf nodes corresponding to the K input sequences.

This problem requires internal nodes to correspond to new sequences that


need to be constructed by the algorithm, to give a minimum weight tree. We
explain this problem using Example (1) with the solution shown in Figure 2.3.
The given nodes are at the leaves and the solution shows three reconstructed
internal nodes.
This problem is called the Steiner tree problem and is known to be NP-hard.
Section 2.9 gives a brief exposition on problem complexity. The conclusion
is that the problem is hopelessly hard and it is unlikely that there exists an
efficient (polynomial time) algorithm that will guarantee the optimal solution
to this problem.
22 Pattern Discovery in Bioinformatics: Theory & Algorithms

2
CGACCCGCAT CCATCCGCAT

1 1 1 1

CGACTCGCAT CCATCCCCAT
s1 CGACCCGCCT CCATCCGCAC
s2 1 1 s3

CCATTCCCAT CCAGCCCCAT
s4 s5

FIGURE 2.3: The tree of minimum weight with the given vertices as leaf
nodes labeled as s1 to s5 .

Again, we first try a naive approach to solve the problem. As before, we


count all possible trees that have the given K nodes as leaf nodes. This is
already quite difficult so we count

T b(K),

the number of binary or a bifurcating trees. In a binary tree, every internal


node has exactly two children and one parent:2 thus every internal node has
degree three. If
T a(K)
is the number of trees with K leaf nodes then

T b(K) ≤ T a(K).

We calculate the number T b(K) in Section 2.8.1.

2.5 Tree Problem 3: Minimum Mutation Labeling


We next identify the following problem. Let each node v be labeled by a
character
σ ∈ Σ,
written as L(v). In the following discussion we assume that L(v) is a set of
characters (instead of a single character) with the connotation that any of

2 In a rooted tree, the only exception is the root which has no parents.
Basic Algorithmics 23

these characters can be assigned as a label to v to obtain an optimal assign-


ment to the tree. Then the weight of the edge (vi vj ) is defined as:

0 if L(vi ) ∩ L(vj ) 6= φ,
dij =
1 otherwise.

Problem 4 (Minimum mutation labeling) Given a tree T (V, E) with K


labeled leaf nodes, the task is to compute a labeling L so as to minimize the
weight of T , X
W (T, L) = dij .
vi ,vj ∈V

In the following discussion we assume that L(v) is a set of characters (in-


stead of a single character) and each of these characters can be assigned as a
label to v to obtain an optimal assignment to the tree.

2.5.1 Fitch’s algorithm


We use the following simple observation to design an algorithm for the
problem.3 It states that the optimal solution to the problem can be obtained
from the optimal solution to the subproblems, which in this case are labeled
nonoverlapping subtrees of the tree.

LEMMA 2.6
(Two-tree partition) Let
T (V, E)
be a tree with subtrees

T (V1 , E1 ) and T (V2 , E2 ),

such that
V1 ∩ V2 = φ,
and
V = V1 ∪ V2 ∪ {v0 } where v0 6∈ V1 , V2 .
Further
E = E1 ∪ E2 ∪ {v0 v1 , v0 v2 },
for fixed
vi ∈ Vi , i = 1, 2.
Let the minimal weight of a tree T ′ be given as

Wopt (T ′ ).

3 Strictly
speaking, Fitch’s algorithm was presented for a rooted bifurcating (binary) tree.
We have generalized the principle here.
24 Pattern Discovery in Bioinformatics: Theory & Algorithms

Then

Wopt (T1 ) + Wopt (T2 ) ≤ Wopt (T )


≤ Wopt (T1 ) + Wopt (T2 ) + 1.

PROOF We are given that the labeling of T1 and T2 are optimal. Clearly
the following is not possible

Wopt (T ) < Wopt (T1 ) + Wopt (T2 ).

If
L(v1 ) ∩ L(v2 ) 6= φ,
then by the labeling
L(v0 ) ← L(v1 ) ∩ L(v2 ),
we get
Wopt (T ) = Wopt (T1 ) + Wopt (T2 ),
which is clearly the optimal, since the weight cannot be improved (reduced)
any further.
If
L(v1 ) ∩ L(v2 ) = φ,
then by the labeling
L(v0 ) ← L(v1 ) ∪ L(v2 ),
we get
Wopt (T ) = Wopt (T1 ) + Wopt (T2 ) + 1.
Again this is clearly optimal, since if it were not, then there exists a labeling
of T1 , say, such that the new weight is less than the given weight, which is a
contradiction.

The following lemma, is a more general form of Lemma (2.6) in the sense
that the number of partitioning subtrees can be larger than 2.

LEMMA 2.7
(Multi-tree partition) Let
T (V, E)
be a tree with subtrees
T (Vi , Ei ), 1 ≤ i ≤ p
such that
Vi ∩ Vj = φ, for 1 ≤ (i 6= j) ≤ p,
Basic Algorithmics 25

and for all i,


[ [ 
V = Vi {v0 } , for v0 6∈ Vi .
i

Also,
[ [ 
E= Ei {v0 vi } ,
i

for a fixed vi ∈ Vi . In words, the tree T is the union of the nonoverlapping


subtrees Ti and a vertex v0 that is connected to each of the tree Ti at a vertex
vi ∈ Vi . Let
cnt(σ) = |{vi | 1 ≤ i ≤ p, σ ∈ L(vi )}|.
Then
p
X
Wopt (Ti ) ≤ Wopt (T )
i=1
p 
X 
≤ Wopt (Ti ) + (p − max(cnt(σ)) .
σ∈Σ
i=1

The arguments for this lemma are along the lines of that of Lemma (2.6)
and we skip the details here as they give no further insight into the problem
than we already have.
From lemma to algorithm. Do the lemmas give an indication of the algo-
rithm that can solve the problem? Actually, they do. This is a classic case of
obtaining the optimal solution for a problem using optimal solutions to the
subproblems. The task is to break the given tree into subtrees which can be
labeled optimally and then build from there. The starting point is the collec-
tion of trees, the singleton nodes, the leaf nodes, which are optimally assigned
by the given problem.
However, there is one catch. While we solve the subproblems, it is important
to keep track of all the possible labelings of the roots, vi of the subtrees Ti .
The lemmas state that W (T ) is optimal but what is the guarantee that there
is no other labeling of the nodes that gives the optimal solution? For example
a suboptimal labeling of the internal nodes of T1 such that

W (T1 ) = Wopt (T1 ) − 1

could give the optimal labeling for T . Let the suboptimal labeling be denoted
by L′ . Then
L′ (v1 ) ∩ L(v1 ) = φ
and at least one other vertex

v1′ 6= v1 ∈ V1
26 Pattern Discovery in Bioinformatics: Theory & Algorithms

is such that
L′ (v1′ ) 6= L(v1′ ).

Next, we claim that a possible alternative labeling of

v1′ ∈ V1

does not matter. This is because T1 is a subtree and the only node that
connects it to the remainder of the tree is v1 , and hence v1′ will never be
considered in the future as well. This implies that the algorithm

1. works for a tree, but not necessarily for a graph, and,

2. gives some optimal labelings but not all optimal labelings of the internal
nodes.

After understanding all the algorithmic implications of Lemmas (2.6), (2.7),


we are now ready to present the algorithm. We first assign depth to every
node as follows: Each leaf node v is assigned a

Depth(v) ← 0

and each nonleaf node v is assigned a depth Depth(v) as the shortest path
length to any leaf node. Let maxdepth be the maximum depth assigned to
any vertex in the tree.

Algorithm 2 Minimum Mutation Labeling


(0-1) FOR EACH leaf node v with label σv ,
L(v) = {σv }
(0-2) W t ← 0

FOR d = 1, 2, . . . , maxdepth DO
FOR EACH v ∈ V with depth(v) = d
U ← {u | (uv) ∈ E, (depth(u) < d)}
v(σ) = {u | (u ∈ U ) AND (σ ∈ L(u))}
L(v) ← {σ ′ | v(σ ′ ) = maxσ∈Σ |v(σ)|}
W t ← W t + (|U | − maxσ∈Σ |v(σ)|)
ENDFOR

This algorithm is simple but not necessarily the most efficient. The optimal
weight is computed in the variable W t. The depth of the nodes need not
be explicitly computed and the tree can be traversed bottom-up from the
leaves. For rooted bifurcating trees, this is also known as Fitch’s algorithm.
Figure 2.4 gives an example illustrating this simple and elegant algorithm.
Basic Algorithmics 27

1
G G C C
G
0 0 0 0 0 0 0 0
0 0
C C C
C G G C G G C G G C
G G
0 0 0 0 0 0

C C C C C C C C
(a) (b) (c) (d)

FIGURE 2.4: Labeling internal nodes so as to minimize the weight of


the tree using Fitch’s algorithm. The dashed lines denote the edges under
consideration at that step of the algorithm. The edge is labeled with weight
of 0/1 depending on the labels at the two end vertices. (a) Given tree T with
leaf node labels as shown. (b) The parents of the labeled nodes are assigned
labels. (c)-(d) The last step is continued until all nodes are labeled. For the
given tree, Wopt (T ) = 1.

2.6 Storing & Retrieving Elements


Encouraged by the success of her quest described in the previous sections,
Professor Marwin decides to store the genomic data of her chromo-bacteria
for future reference. For simplicity, we use only the first five bases of the
genomic sequence. Her problem is formalized as follows.

Problem 5 Let C be a collection of n elements

ei1 < ei2 < ei3 . . . < ein .

How efficiently can the existence of an arbitrary element ei be checked (or


retrieved) form C?
Supposing the elements are stored in an arbitrary order in the database. Then
a search for ei requires time
O(n),
which is the time taken to linearly scan the collection.
Next, let us store the elements in the order as shown in an array A of size
n. We probe the database as follows: split the n elements into two and test if
ei is in the first or the second collection by simply looking at the first element
e′ of the second collection, i.e.,
hn i
e′ ← A +1 .
2
If
ei ≥ e′
28 Pattern Discovery in Bioinformatics: Theory & Algorithms

CCAGCCCCAT CCAGCCCCAT

CCATCCGCAC CCATCCGCAC
CCATCCGCAT
CCATTCCCAT CCATTCCCAT

CGACCCGCCT CGACCCGCCT

CGACTCGCAT CGACTCGCAT
(1) Doubly linked list (2) Adding a new element

FIGURE 2.5: Doubly liked linear list: Each element in the linked list has
a pointer to the previous and the next element. The element shown in the
dashed box is a new element that is added in lexicographic order as shown.
This takes O(n) time.

CCATTCCCAT

CCATCCGCAC CGACCCGCCT
log n

CCAGCCCCAT CGACTCGCAT
(1) A balanced binary tree

CCATTCCCAT

CCATCCGCAC CGACCCGCCT
log n
CCATCCGCAT

CCAGCCCCAT CGACTCGCAT
(2) Inserting a new element in the tree

FIGURE 2.6: The elements stored in a balanced binary tree. The element
shown is a dashed box is the element being added to the tree, maintaining its
balanced property. This takes O(log n) time.
Basic Algorithmics 29

then the element is possibly in the second collection, otherwise it’s possibly
in the first. We repeat this process until either ei is found or we run out of
subcollections to search. The time taken for this process can be computed
using the following recurrenceequation
T (n) = 1 if n = 1
T n2 + O(1) if n > 1
Asymptotically,
T (n) = O(log n).
See Section 2.7 and Exercise 7.
Next Professor Marwin realizes that a new sequence needs to be added very
often. It seem like a waste of effort to redo the whole computation every time.
The problem is modified as follows.

Problem 6 (Element retrieval) Let C be a dynamic collection of objects.


How efficiently can the existence of an arbitrary element ei be checked (or
retrieved) from C?
Since C is a dynamic collection, the elements cannot be stored in an array. The
simplest alternative is to store them in a linked list as shown in Figure 2.5. But
this list cannot be accessed through an index number. It is linearly traversed
using the links each time. A new element is added by deleting and old link
and updating the links of the previous and next elements appropriately. It is
easy to see that it takes time
O(n),
to insert and element. Can this time be improved?
Figure 2.6 shows a balanced binary tree. Each node stores element ep in
the tree, and has a left child that stores el and a right child that stores er
where the (lexicographic) order holds:
el < ep < ep .
Balanced tree. Let the height of the subtree rooted at node p be the maxi-
mum number of edges it takes to reach a leaf node. For a node p let
p
Hlef t

denote the height of the tree rooted at its left child. Let the height of the
right child be
p
Hright .
A tree is balanced if for every node p,
p p
|Hright − Hlef t | ≤ 1.

In other words a difference of at most one is allowed in the height of the left
and the right subtrees of a node. In Figure 2.6, the root r (the vertex at the
top) has
r r
Hright = Hlef t = 2.
30 Pattern Discovery in Bioinformatics: Theory & Algorithms

Every other node has a difference of one in the heights of its left and right
child.
Thus when a new element is added, the tree has to remain balanced. The
height of the tree is
O(log n).
In the example in Figure 2.6, the element is added in the correct lexicographic
order and the tree continues to be balanced. In general, it is possible to make
some local adjustments so that the tree is balanced. This balancing can be
done in time
O(log n).
Thus inserting an element in a balanced binary tree takes time

O(log n).

Examples of such data structures are AVL trees (named after the authors
Adelson-Velskii and Landis), 2-3 trees, B-trees, red-black trees and splay-
trees [CLR90].

2.7 Asymptotic Functions


Often it is easier to represent an entity by a simpler, albeit less accurate,
version that avoids clutter and helps focus on the appropriate features. For
example consider a number

g = 205.42332122132122.

Let an acceptable approximation be f with only two digits after the decimal
and 4
|g − f | ≤ 0.005.
The feature here is apparently the usability with acceptable monetary units.
Denote this ‘approximation’ as Υ, then this can be written as

g = Υ(f ).

Note that f is an approximation to a set of g’s.


We generalize this idea to a nonnegatively valued function g(n). Note that
this is not exactly the same notion. The similarity is only in replacing a
complex function by a simpler one for further scrutiny. Here, our interest is

4 If
g was obtained as the interest computed for thirteen months on a sum of money and
had to be paid to a client, then f = 205.42 is an acceptable approximation.
Basic Algorithmics 31

in studying the growth of this function with n- this is called the asymptotic
behavior of the function
g(n).
In the context of algorithm analysis, various approximations are of interest,
some of which are listed in Figure 2.7. We show five different forms of Υ:

(a) O(·) read as ‘big-oh’,

(b) o(·) read as ‘small-oh’,

(c) Ω(·) read as ‘omega’,

(d) ω(·) read as ‘small-omega’ and

(e) Θ(·) read as ‘theta’.


Each has the following characteristics:

1. ignoring constants, since different machines would give a different num-


ber, that can be safely overlooked and

2. studying the function at infinity or very large n.

The ‘big-oh’ function of f (n) denoted as

O(f (n)),

is the most commonly used asymptotic notation and we take a closer look at
its definition. O(f (n)) is the set of functions g(n) such that there exist (∃)
positive constants c and n0 , satisfying

0 ≤ g(n) ≤ cf (n),

for all (∀)


n > n0 .
In other words, for sufficiently large n, there exists some constant c > 0 such
that
cf (n)
is always larger than
g(n).
In practice, to obtain
O(f (n)),
usually the highest order term, ignoring the constants suffices. For example,

g(n) = 507n2 log n + 36n + 2054


= O(n2 log n).
32 Pattern Discovery in Bioinformatics: Theory & Algorithms

set of functions example


notation n≥
g(n) g(n) f (n)

(a) O(f (n)) ∃ c, n0 0 ≤ g(n) ≤ cf (n) n0 106 n = O(n)


(b) o(f (n)) ∀ c, ∃ nc 0 ≤ g(n) ≤ cf (n) nc n = o(n2 )

(c) Ω(f (n)) ∃ c, n0 0 ≤ cf (n) ≤ g(n) n0 10−6 n = Ω(n)


(d) ω(f (n)) ∀ c, ∃ nc 0 ≤ cf (n) ≤ g(n) nc n2 = ω(n)

∃ c0 , 106 n +
(e) Θ(f (n)) 0 ≤ c0 f (n) ≤ g(n) ≤ c1 f (n) n0
c1 , n 0 1010 = Θ(n)

FIGURE 2.7: Different asymptotic notations of functions g(n).

Sometimes, simple algebraic manipulations may help to obtain

O(f (n))

in the most appropriate form. For instance,

g(n) = 507n2 log n + 36n log2 n + 2054


= O(n2 log n + n log2 n)
= O(n log n(n + log n))
= O(n2 log n).

For more intricate instances, refer to the definitions in Figure 2.7. A picto-
rial representation of some of the functions is shown in Figure 2.8 for conve-
nience.

2.8 Recurrence Equations


A function that is defined in terms of itself is represented by a recurrence
equation. For example, consider the factorial function

F ac(n) = n!

which is defined as the product of natural numbers

1, 2, 3, . . . , n.
Basic Algorithmics 33

cf(n) c2f(n)
g(n) g(n) g(n)

cf(n) c1f(n)

n n n
n0 n0 n0
g(n) = O(f (n)) g(n) = Ω(f (n)) g(n) = Θ(f (n))

FIGURE 2.8: The behavior of the asymptotic functions with respect to


g(n). Note that in each of the cases, for smaller n (n ≤ n0 ), the relative values
of the functions does not matter. However, for large values of n, g(n) must be
consistently larger or smaller than the other functions as shown in the three
cases.

The same function can also be written as:



1 if n = 1
F ac(n) =
nF ac(n − 1) if n > 1
The recurrence equation has two parts:
1. the base case (here n = 1) and
2. the recurring case (here n > 1 case).
A useful asymptotic approximation of the factorial function is given by
Stirling’s formula:
√  n n+o(1)
F ac(n) = 2πn (2.1)
e
In other words, for large n, F ac(n) can be approximated by
√  n n
2πn .
e
Consider the Fibonacci series of numbers which are generated by adding
the last two numbers in the series:

0, 1, 1, 2, 3, 5, 8, 13, 21, . . . .

The nth Fibonacci number given as

F ib(n),

can can be written as the following recurrence equation



n if n = 0 or n = 1,
F ib(n) =
F ib(n − 1) + F ib(n − 2) if n > 1
34 Pattern Discovery in Bioinformatics: Theory & Algorithms

A recurrence form of this kind serves a concise way of defining the function.
But a closed form is more useful for algebraic manipulations and comparison
with other forms. The closed form expression for the Fibonacci number is:
" √ !n √ !n #
1 1+ 5 1− 5
F ib(n) = √ − .
5 2 2

It can also be easily seen that:

F ib(n) = O(cn ),

where √ !
1+ 5
c= .
2

2.8.1 Counting binary trees


We illustrate the use of setting up recurrence equation and then obtaining
a closed form solution, if possible, on a tree.

Problem 7 (Counting binary trees) Let the leaf nodes of a binary tree be
labeled from
1, 2, . . . K.
How many such trees can be constructed?

Let
T b(K)
denote the number of binary trees with K leaf nodes. We first compute

E(K),

the number of edges in a binary tree with K leaf nodes. Clearly,

E(2) = 1.

Given a binary tree with K leaf nodes, if a new leaf node is to be added to
the tree, the leaf node with an edge has to be always attached to the middle
of an existing edge. Thus the existing edge is lost and three new edges are
added, effectively increasing the number of edges by 2. Thus, for K > 2,

E(K + 1) = E(K) + 2.

A closed form solution to this recurrent equation can be obtained by using a


method called telescoping. We illustrate this method below.
Basic Algorithmics 35

The recurrence equation is:



1 if K = 2
E(K) =
E(K − 1) + 2 if K > 2
The closed form of this is computed as follows. For K > 2,
E(K) = E(K − 1) + 2
= E(K − 2) + 2 + 2
= E(K − 3) + 2 + 2 + 2
..
.
= E(K − (K − 2)) + 2(K − 2)
= E(2) + 2(K − 2)
= 1 + 2(K − 2)
= 2K − 3.
Now it should be clear why this approach is called telescoping. We now
construct the next set of recurrent equations. Let
T b(K)
denote the number of unrooted binary trees with K leaf nodes. Consider
K = 2. Clearly, there is only one tree, thus
T b(2) = 1.
Next, we define
T b(K)
in terms of
T b(K − 1).
A new leaf node can be attached to any edge in a tree with (K − 1) leaves
and this tree has E(K − 1) edges. Thus for K > 2,
F (K) = E(K − 1)F (K − 1)
= (2K − 5)F (K − 1).
The recurrence equation is:

1 if K = 2.
T b(K) =
(2K − 5)T b(K − 1) if K > 2.
The closed form is as follows:
(2K − 4)!
T b(K) = (2.2)
2K−2 (K− 2)!
See Exercise 8 at the end of the chapter for f (K), where
T b(K) = O(f (K)).
36 Pattern Discovery in Bioinformatics: Theory & Algorithms

2.8.2 Enumerating unrooted trees (Prüfer sequence)


Consider the problem of enumerating trees on a collections of nodes.

Problem 8 (Enumerating unrooted trees) Given K nodes labeled from,

1, 2, . . . , K,

enumerate all (unrooted) trees on the labeled nodes.

Such a tree is a spanning tree on the complete graph on K vertices. While


enumerating trees, it is important to do so in a manner that avoids repetition.
A moment of reflection will show that this is not as trivial a task as it seems
at first glance.
The trees are unrooted, i.e., there is no particular node in the tree that is
designated to be a root. Hence the process must identify identical (isomorphic)
trees.
It turns out that there is a very elegant way of solving this problem using
Prüfer sequences [Prü18]. This sequence is associated with a tree whose nodes
are labeled from
1, 2, . . . , K,
and is generated iteratively as follows:
Given a tree with labeled nodes, at step i,

1. pick the leaf node li with the smallest label,

2. output li ’s immediate neighbor pi , and

3. remove li from tree.

This process is continued until only two vertices remain on the tree. The
sequence of labels
p1 p2 . . . pK−3 pK−2
is called the Prüfer sequence. Notice that repetitions are allowed in the se-
quence, i.e., it is possible that for some

1 ≤ i, j ≤ K − 2,

the labels are such that


pi = pj .
Figure 2.9 shows two examples of a Prüfer sequence and its corresponding
labeled tree.
Given a Prüfer sequence p, the construction of the tree can be done in a
single scan of the sequence from left to right. The algorithm first constructs
L, the set of nodes missing from p. At each step of the scan we introduce an
edge between the minimum element of L and the currently scanned element
Basic Algorithmics 37

4 3
4
1 6 2
6
2 3 5 1 5
p = 1114 p = 2346

FIGURE 2.9: Examples of trees and the corresponding Prüfer sequences.

of the sequence. L is updated by removing the minimum element and adding


the scanned element, if it is not repeated any more. This process is described
in Algorithm (3) and illustrated in Figure 2.10. In the algorithm description,
Π(s) denotes the set of all numbers occurring in a sequence s.

Algorithm 3 Constructing trees from Prüfer Sequences


(0-1) l ← |p|
(0-2) V ← {1, 2, . . . , (l + 2)}
(0-3) L ← V \ Π(p)

FOR i = 1, 2, . . . , l DO
(1) m ← min L
(2) Introduce edge (m, p[i]) in E
(3) L ← L \ {m}
(4) If p[i] 6∈ Π(p[i + 1 . . . l]) L ← L ∪ {p[i]}
ENDFOR
(5) Introduce edge (m1 , m2 ) in E where m1 , m2 ∈ L

Proof of correctness of algorithm (3). We first show that the graph


constructed by Algorithm (3) is connected. At the start of the algorithm,
there are K connected components corresponding to each vertex. Assign a
distinct color to each component. Every time a vertex from component i is
connected to a vertex from component j(> i), all the vertices in component
j are assigned the color of component i. The very first time line (2) of the
iteration is executed, assume the first connected component of the graph is
under construction with the color of p[1] as color 1.
Next it is easy to see that at the end of the i-th iteration of the algorithm,
the vertex p[i] (which belongs to the connected component containing the
last edge added), must occur again as incident on a subsequently added edge
(either as p[j] with j > i or as min(L)). Every vertex gets added to L exactly
once, either at line (0-3) or at line (4) and at each iteration a vertex gets
removed from L (line (3)). Hence at the end of l = K − 2 iterations at line
(5), |L| = 2 and there can be at most two connected components remaining
38 Pattern Discovery in Bioinformatics: Theory & Algorithms

corresponding to the two vertices in L and these get connected in the last
step. Thus the constructed graph is connected.
Further, an edge is constructed in each iteration and the number of itera-
tions is K −1, hence the graph has K −1 edges. A connected graph with K −1
edges on K nodes is a tree (see Exercise 2). This proves that the algorithm
is correct. 2
The theorem below follows directly from Algorithm (3) and its proof of
correctness.

THEOREM 2.1
5
(Prüfer’s theorem) There is a bijection from the set of Prüfer sequences
of length
K −2
on
V = {1, 2, . . . , K}
to the set of trees T with vertex set V . In other words, a Prüfer sequence
corresponds to a unique tree and vice-versa.

COROLLARY 2.2
The number of trees on K nodes is (Cayley’s number):

K K−2 .

PROOF Since there is a one-to-one correspondence between sequences of


length
K −2
on
{1, 2, . . . , K}
and trees with K nodes, the number of such trees is the same as number of
distinct sequences of length
K − 2,
which is
K K−2 .

Back to enumerating all trees. Now, it is easy to see that to enumerate


all trees on K nodes, we need to enumerate all Prüfer sequences of length

5 Given sets X, Y , a function f : X → Y is injective or one-to-one if (f (x) = f (y)) =⇒


(x = y) and is surjective or onto if for every y ∈ Y , there is an x ∈ X with f (x) = y. A
function f that is both injective and surjective is called a bijection or a bijective function.
Basic Algorithmics 39

4 4

1 7 8 1 7 8

2 3 5 6 2 3 5 6
(a) p = 111484 L = {2, 3, 5, 6, 7} (b) p = 1 11484 L = { 2 , 3, 5, 6, 7}

4 4

1 7 8 1 7 8

2 3 5 6 2 3 5 6
(c) p = 1 1484 L = { 3 , 5, 6, 7} (d) p = 1 484 L = { 5 , 6, 7}

4 4

1 7 8 1 7 8

2 3 5 6 2 3 5 6
(d) p = 4 84 L = { 1 , 6, 7} (e) p = 8 4 L = { 6 , 7}

4 4

1 7 8 1 7 8

2 3 5 6 2 3 5 6
(f) p = 4 L = { 7 , 8} (g) p = φ L = {4, 8}

FIGURE 2.10: Illustration of Algorithm (3) on Prüfer sequence p =


111484. (a) The tree has |p| + 2 = 8 nodes. (b)-(f) Each iteration of the
algorithm: The boxed element in p is being scanned and the minimum label
of L is boxed. Consequently the bold edge in the graph is the one that is
constructed at this step. (g) The very last edge is on the two vertices of L.
40 Pattern Discovery in Bioinformatics: Theory & Algorithms

K − 2 and a unique labeled tree can be constructed from each sequence in


linear time,
O(K),

using Algorithm (3).

2.9 NP-Complete Class of Problems


We saw earlier in the chapter that the minimum spanning tree problem
(Problem (2)) has a polynomial time solution. We were unable to devise
such an algorithm for a very similar problem, the Steiner tree problem (Prob-
lem (3)). Usually, a problem that has a polynomial time algorithm,

O(nc ),

is called tractable, where n is the size of the input and c is some constant. So,
is the Steiner tree problem really not tractable or were we not smart enough
to find one?
Theoretical computer scientists study an interesting class of problems, called
the NP-complete 6 problems, whose tractability status is unknown. No poly-
nomial time algorithm has been discovered for any problem in this class, to
date. However, the general belief is that the problems in this class are in-
tractable. Needless to mention, this is the most perplexing open problem in
the area.
Notwithstanding the fact that the central problem in theoretical computer
science remains unresolved, techniques have been devised to ascertain the
tractability of a problem using relationships between problems in this class.
Suppose we encounter Problem X. First we need to check if this problem
has been studied before. A growing compendium of problems in the class
of NP-complete problems exist and it is very likely that a new problem one
encounters is identical to one of this collection. For example, our Problem (3)
was identified to be the Steiner tree problem.
If Problem X cannot be identified with a known problem, then the next step
is to reduce a problem, Problem A, from the NP-complete class in polynomial
time to Problem X. The reduction is a precise process that demonstrates
that a solution to Problem A can be obtained from a solution to Problem X
in polynomial time. This proves that Problem X also belongs to the class of
NP-complete problems.

6 NP stands for ‘Non-deterministic Polynomial’ and any further discussion requires a fair

understanding of formal language theory and is beyond the scope of this exposition.
Basic Algorithmics 41

Once it is established that Problem X is in the class of NP-complete prob-


lems, it is futile to seek a polynomial time algorithm.7 The next course of
action may be to either redefine the problem depending on the application,
or design an approximation algorithm. Loosely speaking, an approximation
algorithm guarantees the computed solution to be within an ǫ factor of the
optimal.

Summary
The chapter introduces the reader to a very basic abstraction, trees. The
link between pattern discovery and trees will become obvious in the later
chapters. To understand and appreciate the issues involved with trees, we
elaborate on three problems: (1) Minimum spanning tree, (2) Steiner tree and
(3) Minimum mutation labeling. The first and the third have a polynomial
time solution. The reader is also given a quick introduction to using the same
abstraction as a data structure (balanced binary trees). Recurrence equation
is a simple yet powerful tool that helps in counting and Prüfer sequences are
invaluable in enumerating trees.
This was a brief introduction to the exciting world of algorithmics. In my
mind, this is the foundation for a systematic subject such as bioinformatics.
The beauty of this field is that some very powerful statements (that will be
used repeatedly elsewhere in the book), such as the number of internal nodes
in a tree is bounded by the number of leaf nodes, are consequences of very
simple ideas (see Exercise 4). The intent of the chapter has been to introduce
the reader to basic concepts used elsewhere in the book as well as influence
his or her thought processes while dealing with computational approaches to
biological problems.

2.10 Exercises
Exercise 1 (mtDNA) DNA in the mitochondria (mtDNA) of a cell traces
the lineage of a mother to child. A health center that gathers mtDNA data for
families to help trace and understand hereditary diseases accidentally mixes
up the mtDNA data losing all lineage information for a family with seven

7 Though in theory it is possible since the central tractability question is still unresolved, in
practice, it is extremely unlikely.
42 Pattern Discovery in Bioinformatics: Theory & Algorithms

generations. The entire sequence information of the mtDNA for each member
is accessible from a database.

1. Assuming the only change in the mtDNA of a daughter is at most one


mutation at a locus of the mother’s mtDNA, can the lineage information
be recovered?

2. What are the issues to be kept in mind in designing a recovery system


for this scenario?

Exercise 2 (Tree edge) Show the converse of Lemma (2.2). In other


words, prove that given a connected graph G(V, E),

(G is a tree) ⇐ (|E| = |V | − 1)

Hint: Use induction on |V |, the number of vertices.

Exercise 3 (Tree leaf nodes) Show the a tree T (V, E) with |V | > 1 must
have at least l = 2 leaf nodes.
Does the statement hold for l = 3 (assume |V | > 2)?

Hint: Fix a vertex vf ∈ V . Define a notion of distance from vf to any

(v 6= vf ) ∈ V.

Note that v with the largest distance from vf must be a leaf node.

Exercise 4 (Linear size of trees) Given a tree T (V, E), let l be the number
of leaf nodes. Show that
|V | ≤ 2l.

Exercise 5 In the optimally (and completely) labeled Steiner tree, a node v


is always optimally assigned with respect to its immediate neighbors.
Prove or disprove this statement.
Basic Algorithmics 43
The leaf node assignments, that
cannot be relabeled, are shown
C enclosed in boxes. Notice that
C C C the above statement holds true
G for this labeling of the tree. The
Hint: G
G G cost of this tree is 3 correspond-
C ing to the three dashed edges.
G Can the cost be improved with a
different labeling of the internal
nodes?

Exercise 6 Show that Equation (2.2) is the solution to the recurrence equa-
tion: 
1 if K = 2
F (K) =
(2K − 5)F (K − 1) if K > 2

Hint: Use the telescoping technique to obtain

F (K) = 1 × 3 × 5 × . . . × (2K − 5),

for K > 2. Next, use induction on K.

Exercise 7 Show that asymptotically

T (n) = O(log n),

where 
1 if n = 1,
T (n) =
T (n/2) + O(1) if n > 1.

Exercise 8 Show that asymptotically

T b(k) = (O (k))k ,

where
(2k − 4)!
T b(k) =
2k−2 (k− 2)!
Hint: Use Stirling’s approximation (Equation (2.1) for the factorials in T b(k).

Exercise 9 ∗∗ (1) Design an algorithm to add an element to a balanced tree.


(2) Design an algorithm to delete an element from a balanced tree.
44 Pattern Discovery in Bioinformatics: Theory & Algorithms

Hint: An efficient algorithm can be designed by maintaining extra informa-


tion at each node. See a standard text on algorithms and data structures such
as [CLR90].

Exercise 10 (MaxMin path problem) Let G(V, E) be a weighted con-


nected graph with

wt(v1 v2 ) ≥ 0 for each (v1 v2 ) ∈ E.

A path, P (v1 , v2 ), from v1 to v2 is given as:

P (v1 , v2 ) = (v1 =vi1 ) vi2 . . . vik−1 (vik =v2 ),

for some k. The path length is defined as follows:

Len(P (v1 , v2 )) = Len((v1 =vi1 ) vi2 . . . vik−1 (vik =v2 ))


= min wt(vij−1 vij ).
j=2...k

Given two vertices v1 , v2 ∈ V , the distance between the two is given as the
maximum path length over all possible paths between v1 and v2 , i.e.,

Dis(v1 , v2 ) = max Len(P (v1 , v2 )).


P (v1 ,v2 )

1. Design an algorithm to compute

Dis(u, v) for each v ∈ V,

where u ∈ V is some fixed vertex.


2. What is the time complexity of the algorithm?
Hint: 1. Is the following algorithm correct? Why?
1. For each v ∈ V , dis[v] ← 0
dis[u] ← ∞
2. Let S ← V
3. WHILE (S 6= φ)
(a) Let x ∈ S be such that

dis[x] = max (dis[v]) .


v∈S

(b)
S ← S \ {x}
Basic Algorithmics 45

(c) For each (v ∈ S AND (xv) ∈ E)


i. dis[v] ← max{dis[v], min(dis[x], wt(xv))}
ii. Keep track of the path here, by storing backpointer to x, if the
second term contributed to the max value

4. For each v, dis[v] is the required distance.

2. This works in time


O(|V |2 |E|).
Is it possible to improve the time complexity to

O(|V | log |V ||E|)?

Exercise 11 (Expected depth on an execution tree) Now we address


the problem of computing the average path depth (number of vertices in the
path from the root, or, depth from the root) in the MaxMin Problem (Exer-
cise 10), when each edge weight wt(v1 , v2 ) is distributed uniformly over (0,1).
Let T be the execution tree generated by the algorithm: this has k nodes.

1. What is the total number of such trees?

2. What is the average distance of a node from the root?

Hint: Let
M (k, d), d ≤ k,
be the total number of vertices at depth d in ALL rooted trees with k nodes.

M (0, 1) = 1,
M (1, 1) = 1,
M (d − 1, k − 1) + kM (d, k − 1) for 0 ≤ d ≤ k,
M (d, k) =
0 otherwise.

1. M (0, k), k ≥ 1, is the total number of all possible trees (not necessarily
distinct):
M (0, k) = k!
2. Average distance, µ(k), of a node from the root on the tree with k nodes:
Pk
d=1dM (d, k)
µ(k) =
kM (0, k)
Pk
d=1 dM (d, k)
= .
k × k!
46 Pattern Discovery in Bioinformatics: Theory & Algorithms

Constructing the functions M (d, k), µ(k) :

k\d −1 0 1 2 3 4 5 6 - - µ(k)
1 0 1 1 0 1
2 0 2 3 1 0 1.25
3 0 6 11 6 1 0 1.44
4 0 24 50 35 10 1 0 1.60
5 0 96 274 225 85 15 1 0 2.17
6 0 - - - 735 175 21 1 0 -
7 0 - - - - - 322 13 1 0 -
Chapter 3
Basic Statistics

Statistics is like your cholesterol health,


it is not just what you do, but more importantly,
what you do not do, that matters.
- anonymous

3.1 Introduction
To keep the book self-contained, in this chapter we review some basic statis-
tics. To navigate this (apparent) complex maze of ideas and terminologies,
we follow a simple story-line.
Professor Marwin, who kept us busy in the last chapter with her algorithmic
challenges, also tends a Koi 1 pond with four types of this exquisite fish: Asagi
(A), Chagoi (C), Ginrin Showa (G) and Tancho Sanshoku (T). The fish have
multiplied tremendously for the professor to keep track of their exact number
but she claims that the four types have an equal representation in her large
pond.
She is introduced to a scanner (a kind of fish net or trap) that catches
no more than one at each trial i.e., zero or one koi. The manufacturer sells
the scanners in a turbo mode, as a k-scanner where k(≥ 1) scanners operate
simultaneously yielding an ordered sequence of k results. The professor further
asserts that each fish in the pond is equally healthy, agile and alert to avoid
a scanner, thus having equal chance of being trapped (or not) in the scanner.
We study the relevant important concepts centered around this koi pond
scenario. We give a quick summary below of the questions that motivate the
different concepts.

1 The scientific name for this beautiful fish is Cyprinus carpio.

47
48 Pattern Discovery in Bioinformatics: Theory & Algorithms

The koi pond setting with A, C, Foundations,


G, T types. Terminology (Ω, F, P )

k-scanner: Bernoulli trials


What is the probability of having Binomial distribution
a nonhomogenous scan? Multiple events

What is the probability of an un- Random variables


balanced scan? Expectations

∞-scanner with type counts Poisson distribution

∞-scanner with type mass Normal distribution

Uniform model: Statistical significance


Is the scanner fair (random)? p-value
Central Limit Theorem

3.2 Basic Probability


Probability is a branch of applied mathematics that studies the effects of
chance. More precisely, probability theory is the study of probability spaces
and random variables.

3.2.1 Probability space foundations


We adopt the Kolmogorov’s axiomatic approach to probability and define
a probability space as a triplet

(Ω, F, P ) (3.1)

satisfying certain axiom stated in the following paragraphs. Once a given


problem is set in this framework, interesting questions (which are usually
about probabilities of specific events) can be answered.

3.2.1.1 Defining Ω and F


We go back to the koi pond to define the different terms. In this scenario,
a random trial 2 is the process of fishing in the pond with a k-scanner. For
example when a 1-scanner is used, the outcome of the random trial is the type

2 Historically this has been called an experiment.


Basic Statistics 49

it yields: A, C, G, T or zero (denoted by -). The sample space of a random


trial is the set of all possible outcomes. This is a nonempty set and is usually
denoted as Ω. In this example,

Ω = {A, C, G, T, -}. (3.2)

An element of Ω is usually denoted by ω.


An event is a set of outcomes of a random trial, thus a subset of the sample
space Ω. When the event is a single outcome, it is often called an elementary
event or an atomic event. Thus events are subsets of Ω and the set of events is
usually denoted by F. An element of F is usually denoted by E. Sometimes

ω ∈ Ω,

is not distinguished from the singleton set

{ω} ∈ F

and is referred to as an elementary event.

3.2.1.2 Measurable spaces∗∗


In mathematics, a σ-algebra over a set Ω is a collection, F, of subsets of Ω
that is closed under countable set operations. In other words,
1. if E ∈ F, then its complement Ē ∈ F and
2. if E1 , E2 , E3 , . . . ∈ F then (∪i Ei ) , (∩i Ei ) ∈ F.
When Ω is discrete (finite or countable), then the σ-algebra is the whole power
set 2Ω (set of all subsets of Ω). However, when Ω is not discrete, care needs
to be taken for probability P to be meaningfully defined. In case

Ω = Rn

it is convenient to take the class of Lebesgue measurable subsets of Rn (which


form a σ-algebra) for F. Since the construction of a non-Lebesgue measurable
subset usually involves the Axiom of Choice, one does not encounter such a
set in practice. So, we do not belabor this point. But it is still important to
note that not every set of outcomes is an event.

3.2.1.3 Defining probability P


Assuming that the pair Ω and F is a measurable space, we now define a
probability measure (or just probability) denoted by P . It is a function from
F to the nonnegative real numbers, written as,

P :F →R

satisfying the Kolmogorov Axioms:


50 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. For each E ∈ F,
P (E) ≥ 0.

2. It is certain that some atomic element of Ω will occur. Mathematically


speaking,
P (Ω) = 1.

3. For pairwise disjoint sets E1 , E2 , . . . ∈ F,


X
P (E1 ∪ E2 ∪ . . .) = P (Ei ).
i

We leave it as an exercise for the reader (Exercise 12) to show that under
these conditions, for each E ∈ F,

0 ≤ P (E) ≤ 1. (3.3)

Usually the probability measure is also called a probability distribution func-


tion (or simply the probability distribution).

3.2.2 Multiple events (Bayes’ theorem)


In practice, we almost always deal with multiple events, so the next nat-
ural topic is to understand the delicate interplay between these multiply (in
conjunction) occurring events.

Bayes’ rule. The Bayesian approach is one of the most commonly used
methods in a wide variety of applications ranging from bioinformatics to com-
puter vision. Roughly speaking, this framework exploits multiply occurring
events in observed data sets by using the occurrence of one or mote events to
(statistically) guess the occurrence of the other events. Note that there can
be no claim on an event being either the cause or the effect.
The simplicity of the Bayesian rule is very appealing and we discuss this
below.
Joint probability is the probability of two events in conjunction. That is, it
is the probability of both events together. The joint probability of events E1
and E2 is written as
P (E1 ∩ E2 )
or just
P (E1 E2 ).
Going back to the foundations, Kolmogorov axioms lead to the natural con-
cept of conditional probability. For E1 with

P (E1 ) > 0,
Basic Statistics 51

the probability of E2 given E1 denoted by

P (E2 |E1 ),

is defined as follows:
P (E1 ∩ E2 )
P (E2 |E1 ) = .
P (E1 )
In other words, conditional probability is the probability of some event E2 ,
given the occurrence of some other event E1 .
In this context, the probability of an event say E1 is also called the marginal
probability. It is the probability of E1 , regardless of event E2 . The marginal
probability of E1 is written P (E1 ).
Bayes’ theorem relates the conditional and marginal probability distribu-
tions of random variables as follows:

THEOREM 3.1
(Bayes’ theorem) Given events E1 and E2 in the same probability space,
with
P (E2 ) > 0,

the following holds:

P (E2 |E1 )
P (E1 |E2 ) = P (E1 ). (3.4)
P (E2 )

The proof falls out of the definitions and the result is often interpreted as:

Likelihood
Posterior = Prior.
normalization factor
We will pick up this thread of thought in a later chapter on maximum likeli-
hood approach to problems.

3.2.3 Inclusion-exclusion principle


Mutually exclusive vs independent events. Recall that an event, in a
sense, is synonymous with a set. Two nonempty sets E1 and E2 are mutually
exclusive if and only if they have an empty intersection. Using the probability
measure P , two events E1 and E2 are mutually exclusive if and only if

1. P (E1 ) 6= 0, P (E2 ) 6= 0, and,

2. P (E1 ∩ E2 ) = 0.
52 Pattern Discovery in Bioinformatics: Theory & Algorithms

Mutually exclusive events are also called disjoint. It follows that if E1 and E2
are mutually exclusive, then the conditional probabilities are zero. i.e.,

P (E1 |E2 ) = P (E2 |E1 ) = 0.

What can we say when E1 ∩ E2 is not empty, i.e,

E1 ∩ E2 6= ∅ ?

There is a very important concept called the independence of events, that


comes into play here. It is a subtle concept and has the same connotation as
the natural meaning of the word ‘independence’. Loosely speaking, when two
events are independent, it means that knowing about the occurrence of one of
them does not yield any information about the other. In case the events are
dependent, usually great care needs to be taken to account for the interplay
arising from their dependence. Note that if

E1 ∩ E2 = ∅

and
P (E1 )P (E2 ) 6= 0,
then the events are (very) dependent.
Mathematically speaking, two events E1 and E2 are independent if and only
if the following hold:

1. P (E1 ) 6= 0, P (E2 ) 6= 0, and,

2. P (E1 ∩ E2 ) = P (E1 )P (E2 ).

An alternative view of the same is as follows. Let E1 and E2 be two events


with
P (E1 ), P (E2 ) > 0.
If the marginal probability of E1 is the same as its conditional (with E2 )
probability, then E1 and E2 are independent. In other words, if

P (E1 ) = P (E1 |E2 ),

then E1 and E2 are said to be independent events.

Union of events. Using the Kolmogorov axioms one can deduce the fol-
lowing:
P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 ).
For a natural generalization to

P (E1 ∪ E2 ∪ . . . ∪ En ),
Basic Statistics 53

define the following quantities: Sl is the sum of the probabilities of the inter-
section of all possible l out of the n events. Thus
X
S1 = P (Ei ),
i
X
S2 = P (Ei ∩ Ej ),
i<j
X
S3 = P (Ei ∩ Ej ∩ Ek ),
i<j<k

and so on.

THEOREM 3.2
(Inclusion-exclusion principle)

P (E1 ∪ E2 ∪ . . . ∪ En ) = S1 − S2 + S3 − S4 + S5 . . . + (−1)n+1 Sn
Xn
= (−1)i+1 Si .
i=1

It can be seen that


S1 ≥ S2 ≥ . . . ≥ Sn ,

hence,

≥ 0 for k odd,
(−1)k Sk + (−1)k+1 Sk+1 (3.5)
≤ 0 for k even.

This implies that (the proof is left as an exercise for the reader):
( Pk
≤ i=1 (−1)i+1 Si for k odd,
P (E1 ∪ E2 ∪ . . . ∪ En ) Pk (3.6)
≥ i=1 (−1)i+1 Si for k even.

Inequalities (3.6) are often referred to as Bonferroni’s Inequalities. They are


useful in practice in order to obtain quick upper and lower estimates on the
probabilities of unions of events. For instance, when k = 1,

P (E1 ∪ E2 ∪ . . . ∪ En ) ≤ S1 .

The above is expanded as follows, which is also known as Boole’s Inequality.

P (E1 ∪ E2 ∪ . . . ∪ En ) ≤ P (E1 ) + P (E2 ) + . . . + P (En ). (3.7)


54 Pattern Discovery in Bioinformatics: Theory & Algorithms

3.2.4 Discrete probability space


When Ω is finite or countable, it is called a discrete space. For such cases,
P induces the following function,

MP : Ω → R≥0

given by
MP (ω) = P ({ω}). (3.8)
In other words, MP assigns a probability to each element ω ∈ Ω. The function
MP is often called a probability mass function. It can be verified that if P
satisfies the Kolmogorov’s axioms (Section (3.2.1.3)) then MP must satisfy
the following (called the probability mass function conditions or probability
axioms):
1. 0 ≤ MP (ω) ≤ 1, for each ω ∈ Ω, and,
P
2. ω∈Ω MP (ω) = 1.

Note also that MP in turn determines P . Thus when Ω is finite, it is simpler


to specify MP instead of P . Also, in this case

F = 2Ω .

Thus for a discrete setting the probability space is specified by the triplet:

(Ω, 2Ω , MP ) (3.9)

A concrete example. We go back to the professor’s koi pond. She decides


to use a 2-scanner and then asks the following question:
What is the probability of having a nonhomogenous scan, i.e., different
types in the same scan?
The specification of Ω and F is usually determined from the nature of the
random event, which in this case is the result of a 2-scanner. Let this be
denoted by uv where u is the outcome of the first and v the outcome of the
second of the 2-scanner. Since the two scanners operate simultaneously, we
define Ω as the following set of 25 elements:
 

 - - A- C- G- T-  
 -A AA CA GA TA 

 

Ω = -C AC CC GC TC (3.10)
-G AG CG GG TG

 


 

-T AT CT GT TT
 

Next, we need some more information before we can specify P or MP . We


gather the following intelligence:
Basic Statistics 55

1. The manufacturer asserts the following:

(a) each scanner in the k-scanner operates simultaneously, thus the


scan of each scanner is independent of the other, and,
(b) the odds of having an empty scan, in each scanner, is just one in
nine and the scanner is not partial to any particular koi type.

2. The professor asserts that each type in her pond is equally likely to be
caught in a scanner (we call this the uniform model).

Bernoulli trials. We model each scanner separately as a (multi-outcome)


Bernoulli trial. Usually a Bernoulli trial has two outcomes: ‘success’ and
‘failure’. However, we use a generalization where each trial has N possible
outcomes and N is some integer larger than one. The probability of each
outcome xi is given as pr(xi ) which are so defined so that the following holds:

N
X
pr(xi ) = 1
i=1

Using (1b) and (2), we have N = 5 and we model pr(x) as follows:




 2/9 if x = A
 2/9 if x = C


pr(x) = 2/9 if x = G (3.11)
2/9 if x = T




1/9 otherwise

From (1a) above, for each uv ∈ Ω,

MP (uv) = pr(u)pr(v) (3.12)

Next, there is a need to show that the probability function P (as derived from
MP ) satisfies the probability measure conditions. This is left as an exercise
for the reader (Exercise 15).

Back to the query. Let E denote the event that the outcome of the 2-
scanner is homogenous. We claim the following:

E = {- -, A-, -A, -C, C-, -G, G-, -T, T-, AA, CC, GG, TT}

We specify nonhomogeneity to be the presence of at least two distinct types in


the scan. We treat the absence of nonhomogeneity to be homogenous and with
this view (subjective interpretation), the first nine elements in the set E are
considered to be homogenous. The last four elements are clearly homogenous.
56 Pattern Discovery in Bioinformatics: Theory & Algorithms

Treating E as the union of singleton sets and since all singleton intersections
are empty, we get
X
P (E) = P ({uv}) (probability measure cond (2))
{uv}∈E
X
= MP (uv) (probability mass function defn (3.8))
{uv}∈E
X
= pr(u)pr(v) (using Eqn (3.12))
{uv}∈E

= 33/81 (using Eqn (3.11))

Let Ē denote the event that the scan is not homogenous. Then P (Ē) is
given by:

P (Ē) = P (Ω \ E) (since E ∪ Ē = Ω)
= P (Ω) − P (E) (probability measure cond (2), and,
since E ∩ Ē = ∅)
= 1 − P (E) (since P (Ω) = 1)
= 48/81

Alternative (CG-rich) model. Note that Ω (and thus F) were specified


in the treatment by studying the nature of the query, which in turn defined
the random trial. However, P was specified by taking the input from the
scanner manufacturer and the professor.
Professor Marwin realizes that her pond has more of the C and G type than
A and T. We call this the CG-rich model. She realizes that some quantitative
information is required and does some testing of her own. She concludes that
the C and G type is three times more ponderous than the A and T . At this
point, we do not question how she gets these numbers (that is a different topic
of discussion), but we must build this conclusion into our model.
Note that the specification of Ω (and thus F) does not change for the CG-
rich model. But pr(x) is redefined as follows:


 1/9 if x = A
 3/9 if x = C


pr(x) = 3/9 if x = G (3.13)
1/9 if x = T




1/9 otherwise

We leave the arguments that pr(x) leads to a probability measure as an ex-


ercise (Exercise 15) for the reader. The computation of the probability of
the nonhomogenous event under this new model is also left as an exercise
(Exercise 16) for the reader.
Basic Statistics 57

3.2.5 Algebra of random variables


Motivation. Using the CG-rich model defined in the last scenario, consider
the questions:

What is the average number of C’s in a scan?


What is the variance?
What is the average number of A’s and C’s?

To answer a question of this kind conveniently, we need to expand our vo-


cabulary. A random variable is a variable that is associated with the outcome
of a random trial or an event and takes real values.
In the algebraic axiomatization of probability theory, the primary concept
is that of a random variable.3 The measurable space Ω, the event space F
and the probability measure P arise from random variables and expectations
by means of representation theorems of analysis. However, we do not explore
this line of thought in detail here but just appeal to the intuition of the reader.
A random variable is not a variable in the usual sense that a single value
may not be assigned to it. However, a probability distribution P 4 may be
assigned to a random variable X, written as

X ∼ P.

More precisely, given a probability space (Ω, F, P ), a random variable, X


is a measurable function defined as:

X : Ω → R,

i.e., X maps Ω to real numbers. See Section (3.2.1.2) for the notion of mea-
surability of functions. Often

P ({ω ∈ Ω | X(ω) ≤ x0 })

is abbreviated as
P (X ≤ x0 ).
Further, since random variables are simply functions, they can be manipu-
lated as easily as functions. Thus, we have the following.

1. A real constant c is a random variable.

2. Let X1 and X2 be two random variables on the same probability space.

3 Recall that in the Kolmogorov’s axiomatic approach the event and its probability is the
primary concept.
4 Usually P is specified by the distribution parameters mean µ and variance σ 2 and written

as P (µ, σ).
58 Pattern Discovery in Bioinformatics: Theory & Algorithms

(a) The sum of X1 and X2 is a random variable denoted

X1 + X2 .

(b) The product of X1 and X2 is a random variable denoted

X1 X2 .

Note that addition and multiplication of random variables are commu-


tative.

3.2.6 Expectations
The mathematical expectation (or simply expectation) of X, denoted by
E[X], is defined as follows:
Z
E[X] = XdP (3.14)

Recall that X is a measurable function.5 E[X] can be interpreted to be the


average value of X. When Ω is finite or countable, X is a discrete random
variable. Recall that the probability space is

(Ω, 2Ω , MP )

and X
E[X] = X(ω)MP (ω). (3.15)
ω∈Ω

The expected values of the powers of X are called the moments of X. The
l-th moment of X is defined by
Z
E[X l ] = X l dP. (3.16)

Similarly, the central moments are expected values of powers of (X − E[X]).

3.2.6.1 Properties of expectations


Using the definition of expectation, we can show the following properties.
Given two random variables X1 and X2 defined on the same probability space,
the following properties hold.

5 In this chapter we have been denoting an event with E. Expectation is also denoted with

an E, and it should be clear from the context what we mean.


Basic Statistics 59

1. Inequality property:

If X1 ≤ X2 then E[X1 ] ≤ E[X2 ].

2. Linearity of expectations: For any a, b ∈ R,

E[aX1 + bX2 ] = aE[X1 ] + bE[X2 ].

3. Nonmultiplicative: In general,

E(X1 X2 ] 6= E[X1 ]E[X2 ].

However, when X1 and X2 are independent,

E[X1 X2 ] = E[X1 ]E[X2 ].

These results are straightforward to prove and follow from the properties of
integration (or summations in the discrete scenario).

3.2.6.2 Back to the concrete example


Consider our running example of the koi pond: What is the average number
of C’s in a scan? What is the variance? What is the average numbers of A’s
and what is the average number of A’s and C’s?
To answer this query, two random variables

XC , XA : Ω → R

may be defined as follows (where Z is C, A):

XZ (ω) = l,

where l is the number of Z’s in ω. Then the average number of C’s in ω is


given by the expected value of XC . Using Ω defined in Equation (3.10) and
the probabilities in Equation (3.13) we obtain:
X
E[XC ] = XC (ω)MP (ω)
ω∈Ω
X
= XC (uv)MP (uv)
uv∈Ω
X X
= MP (uv) + MP (uv)
(u6=v)∈Ω (u=v)∈Ω

= 1(3/81 + 3/81 + 3/81 + 3/81 + 3/81 + 3/81 + 9/81 + 3/81) + 2(9/81)


= 48/81
60 Pattern Discovery in Bioinformatics: Theory & Algorithms

Similarly the average number of A’s in ω is given by

E[XA ] = 18/81.

By linearity of expectations, the average number of A’s and C’s is given by

E[XC + XA ] = E[XC ] + E[XA ]


= 66/81.

The variance of XC , V [XC ], is the second moment about the mean, and is
given by

V [XC ] = E (XC − E[XC ]2


 

= E XC2 + E[XC ]2 − 2XC E[XC ]


 

= E[XC2 ] + E[XC ]2 − 2E[XC ]E[XC ] (using linearity of E)


= E[XC2 ] − E[XC ]2 .

As before using Ω defined in (3.10) and Equation (3.13) we obtain:


X
E[XC2 ] = XC2 (ω)MP (ω)
ω∈Ω
= 1(MP (-C) + MP (C-) + MP (AC) + MP (GC)
+MP (TC) + MP (CA) + MP (CG) + MP (CT))
+4(MP (CC))
= 1 (3/81 + 3/81 + 3/81 + 3/81 + 3/81 + 3/81 + 9/81 + 3/81)
+4(9/81)
= 66/81.

Thus

V [XC ] = (66/81) − (48/81)2


= 3042/6561.

3.2.7 Discrete probability distribution (binomial, Poisson)


Let X be a random variable, then the probability density function of X is
the function f satisfying

P (x ≤ X ≤ x + dx) = f (x)dx.

This distribution can also be uniquely described by its cumulative distribution


function F (x), which is defined for any x ∈ R as

F (x) = P (X ≤ x).
Basic Statistics 61

Random variable Probability mean µ variance σ 2


X ∼ P (µ, σ) mass function f E[X] = V [X] =
(domain) (discrete) X̄ E[(X-X̄)2 ]

(a) Binomial
n
pk (1 − p)n−k

k = 0, 1, . . . , n k np np(1 − p)

(b) Poisson
e−λ λk
k∈N k! λ λ

Probability
density function f
(continuous)
(c) Normal 
(x−µ)2

x∈R √1 exp − 2 µ σ2
σ 2π 2σ

(c’) Standard Normal  


2
x∈R √1 exp − x2 0 1

FIGURE 3.0: Probability distributions and their parameters.

Note that the density function f and the cumulative distribution function F
are related by
Z x
F (x) = f (t)dt.
−∞

The binomial and Poisson distribution are among the most well-known
discrete probability distributions. The normal or the Gaussian is the most
commonly used continuous distribution.
Binomial distribution is the discrete probability distribution that expresses
the probability of the number of 1’s in a sequence of n independent 0/1 ex-
periments, each of which yields 1 with probability p. Thus the mass function

 
n k
prbinomial (k; n, p) = p (1 − p)n−k
k

gives the probability of seeing exactly k number of 1’s for a fixed n and p. It
can be verified that
n
X
prbinomial (k; n, p) = 1.
k=0
62 Pattern Discovery in Bioinformatics: Theory & Algorithms

The mean and variance are as follows:


X̄binomial = E[X]
Xn
= k prbinomial (k; n, p)
k=0
= np
V [X] = E[(X − X̄binomial )2 ]
= np(1 − p)
Poisson distribution is a discrete probability distribution that expresses the
probability of a number of events occurring in a fixed period of time if these
events occur with a known average rate λ, and are independent of the time
since the last event. The mass function is given by
e−λ λk
prpoisson (k; λ) =
k!
and denotes the probability of seeing exactly k events for a fixed λ. It can be
verified that ∞
X
prpoisson (k; λ) = 1.
k=0
Note the convention that
0! = 1.
The mean and variance are as follows:
X̄poisson = E[X]
Xn
= k prpoisson (k; λ)
k=0

V [X] = E[(X − X̄poisson )2 ]

The following lemma gives a convenient way of modeling additive influences
of phenomena. It says that sum of two binomial (respectively Poisson) is also
a binomial (respectively Poisson) random variable. The proofs of the claims
require some intricate but straightforward algebraic manipulations as shown
below. We say that
X∼f
if X is a discrete (respectively continuous) random variable with probability
mass (respectively density) function f .

LEMMA 3.1
Let X1 and X2 be two independent random variables.
Basic Statistics 63

1. If X1 ∼ Binomial(n1 , p) and X2 ∼ Binomial(n2 , p), then

X = X1 + X2 ∼ Binomial(n1 +n2 , p)

2. If X1 ∼ P oisson(λ1 ) and X2 ∼ P oisson(λ2 ), then

X = X1 + X2 ∼ P oisson(λ1 +λ2 )

PROOF Let X = X1 + X2 . Since the domain for both X1 and X2 are


nonnegative integers, we have

pr(X = k) = pr ((X1 + X2 ) = k | X1 = i)
k
X
= f1 (i)f2 (k − i).
i=0

fi is the probability mass function for Xi , i = 1, 2. We now deal with the two
distributions separately.
Binomial distribution:
k    
X n1 i n1 −i n2
pr(k) = p (1 − p) pk−i (1 − p)n2 −k+i
i=0
i k − i

Separating the factors independent of i from the summation


k   
k n1 +n2 −k
X n1 n2
pr(k) = p (1 − p)
i=0
i k−i

The following combinatorial identity, also called the Vandermonde’s identity,

k     
X n1 n2 n1 + n2
= ,
i=0
i k−i k

can be verified by equating the coefficients of xk of both sides in

(1 + x)n1 (1 + x)n2 = (1 + x)n1 +n2 .

Thus,  
n1 + n2 k
pr(k) = p (1 − p)n1 +n2 −k .
k
Hence
X ∼ Binomial(n1 + n2 , p).
64 Pattern Discovery in Bioinformatics: Theory & Algorithms

Poisson distribution:
k  −λ1 i 
!
X e λ 1 e−λ2 λ2k−i
pr(k) =
i=0
i! (k − i)!
k
X λi1 λ2k−i
= e−(λ1 +λ2 )
i=0
i!(k − i)!

Multiplying by k!/k!, we get


k
e−(λ1 +λ2 ) X k!
pr(k) = λi1 λ2k−i
k! i=0
i!(k − i)!

It can be verified that


k k  
X k! i k−i
X k i k−i
λ1 λ2 = λ λ
i=0
i!(k − i)! i=0
i 1 2
= (λ1 + λ2 )k .

Thus
e−(λ1 +λ2 ) (λ1 + λ2 )k
pr(k) =
k!
Hence
X ∼ P oisson(λ1 + λ2 ).

3.2.8 Continuous probability distribution (normal)


The normal or the Gaussian distribution arises in statistics in the following
way. The sampling distribution of the mean is approximately normal, even if
the distribution of the population, the sample is taken from, is not necessarily
normal (we show this in the Central Limit Theorem is a later section). In fact,
a variety of natural phenomena like heights of individuals, or exam scores of
students in a large undergraduate calculus class, or photon counts can be
approximated with a normal distribution. While the underlying mechanisms
may be unknown or little understood, if it can be justified that the errors or
effects of independent (but small) causes are additive, then the likelihood of
the outcomes may be approximated with a normal distribution.
The probability density function of a normal distribution, with mean µ and
variance σ 2 is
(X − µ)2
 
1
prnormal (Xµ, σ 2 ) = √ exp − .
σ 2π 2σ 2
Basic Statistics 65

It can be verified that


Z ∞
prnormal (X; µ, σ 2 )dX = 1.
−∞

The mean and variance are


Z ∞
µ= X prnormal (X; µ, σ 2 )dX
−∞
Z ∞
2
σ = (X − µ)2 prnormal (X; µ, σ 2 )dX
−∞

A random variable X is standardized using the theoretical mean and stan-


dard deviation:
X −µ
Z= ,
σ
where µ = E[X] is the mean and σ 2 = V [X] is the variance of the probability
distribution of X.
The lemma below follows from using the characteristic functions discussed
in Section (3.3.3) which we leave as an exercise for the reader (Exercise 20).

LEMMA 3.2
If X1 ∼ N ormal(µ1 , σ12 ) and X2 ∼ N ormal(µ2 , σ22 ), are two independent
random variables, then

X = X1 + X2 ∼ N ormal(µ1 +µ2 , σ12 +σ22 )

Binomial, Poisson & normal distributions. Figure 3.1 summarizes the


different probability distributions. Before we conclude this section, we relate
the three probability distributions that we studied here. Binomial is in a
sense the simplest discrete probability distribution. It gets its name from its
resemblance to the coefficients of a binomial

(x + y)n .

The function values are indeed the coefficients of this polynomial. Putting

x=p

and
y =1−p
in fact shows the values add up to 1.
66 Pattern Discovery in Bioinformatics: Theory & Algorithms

As n approaches ∞ and p approaches 0 while np remains fixed at λ > 0 or


np approaches λ > 0, the
Binomial(n, p)
distribution approaches the Poisson distribution with expected value λ,

P oisson(λ).

As n approaches ∞ while p remains fixed, the distribution of

X − np
p
np(1 − p)

approaches the normal distribution with expected value 0 and variance 1,

N ormal(0, 1).

3.2.9 Continuous probability space (Ω is R)


We go back to the running example using the professor’s koi pond. She is
lured by the manufacturer to try their turbo ∞-scanner, which they claim is
an extremely long scanner. Then she asks the following question:

What is the probability of an unbalanced scan, i.e., having at least three


times as many as C’s and G’s than A’s and T’s in a scan?

The professor asserts that the pond is large enough for these huge scanners.
Now, we must work closely with the in-house scientists of the manufacturer
to model this problem appropriately. After a series of carefully controlled
experiments, they make the following observations for the ∞-scanners.

1. The chance of getting an empty outcome in an individual scanner in


the ∞-scanner, has gone up significantly with large k’s, giving rise to
sparsely filled ones at each scan.

2. The average number, λ, of a type scanned is independent of the actual


length of the scanner. In the professor’s koi pond the average numbers
for the four types C, G, A and T are observed to be

λC = λG = .054,
λT = λA = .018,

respectively.

3. For a fixed i, the chance of having a nonzero outcome in the ith of the
∞-scanner in multiple scans (trials) is zero.
Basic Statistics 67

Due to independence of the occurrence of each event, the joint probability


of the quadruple is given as:

P (XA , XC , XG , XT ) = P (XA )P (XC )P (XG )P (XT ).

To construct the probability space, let the outcome of a trial be denoted by


a quadruple
(iA , iC , iG , iT )
where iz ,
z = {A, C, G, T },
is the number of type z in the scan. Then we define the probability space

(Ω, 2Ω , MP )

as follows:
Ω = {(iA , iC , iG , iT ) | 0 ≤ iA , iC , iG , iT }.
The occurrence of each type is independent of the other (see observation 3),
and we model each as a Poisson distribution. Thus:
Y
MP (iA , iC , iG , iT ) = P oisson(iz ; λz ). (3.17)
z={A, C, G, T}

and

λC = λG = .054,
λT = λA = .018.

Next, does this probability distribution satisfy the Kolmogorov’s axioms?


This is left as an exercise for the reader (Exercise 17).
However, the model can be simplified using the two following observations:
1. λA = λT , λC = λG , and
2. the professor lumps the C’s and G’s together, and, A’s and T’s together
in her query.
Under this ‘condensed’ model, let the outcome of a trial be denoted by a pair

(iAT , iCG ).

iAT is the number of type A or T and iCG is the number of type C or G in


the scan. We define the probability space

(Ω, 2Ω , MP )

as follows:
Ω = {(iAT , iCG ) | 0 ≤ iAT , iCG }.
68 Pattern Discovery in Bioinformatics: Theory & Algorithms

Event E ⊆ Ω

AT AT

0,0 CG 0,0 CG
(a) Discrete (b) Continuous

FIGURE 3.1: The Ω space in Poisson and Normal distributions. The


shaded region corresponds to xCG ≥ 3yAT as shown.

Also we use the fact the sum of two Poisson random variables (parame-
ters λ1 , λ2 ) is another Poisson random variable (parameter λ1 + λ2 ), by
Lemma (3.1). Thus
MP : Ω → R
is defined as follows:

MP (iAT , iCG ) = P oisson(iAT ; 2λA )P oisson(iCG ; 2λC ). (3.18)

However, the professor’s question is quite a curveball. But we have set


up the probability space quite conveniently to handle queries of such complex
flavors. Let E denote the event that there are at least thrice as many as C+G’s
than A+T’s in a scan. Our interest is in the Ω space shown in Figure 3.1(a).
The shaded region corresponds to

iCG ≥ 3iAT

as shown. Then

X 3i
X CG

P (E) = P oisson(iAT ; 2λA )P oisson(iCG ; 2λC )


iCG =0 iAT =0

X 3i
X CG

= P oisson(iAT ; 2 × .018)P oisson(iCG ; 2 × .054)


iCG =0 iAT =0
= 0.968.

Continuous probability distribution. Let’s go back to the koi pond and


change the scenario as follows. The scanner manufacturer noticed that each
type of koi has a characteristic size (i.e., mass or weight) and since the profes-
sor’s interest is in separating the types, they provide her a scanner that works
along the lines of a mass spectrometer.
In physical chemistry, a mass spectrometer is a device that measures the
mass-to-charge ratio of ions in a given sample. This is done by ionizing the
Basic Statistics 69

sample and passing them through an electric and/or magnetic field, thus
separating the ions of differing masses. The relative abundance is deduced by
measuring intensities of the ion flux.
The scanner provides the relative masses of the four types in a catch. The
average mass of a type scanned has been provided by the manufacturer as

µA = µT = 15

and
µC = µG = 36
and a standard deviation of √
σ=5 2
for each type.
Going back to our running example and using the ‘condensed’ model the
fraction of the number of A+T’s is denoted by the random variable X which
follows a normal distribution, using Lemma (3.2),

X ∼ N ormal(x : 2µC , 2σ 2 ).

Similarly, the fraction of the number of C+G’s is denoted by the random


variable Y , and
Y ∼ N ormal(y : 2µA , 2σ 2 ).
So we define the probability space as:

Ω = [0, ∞) × [0, ∞).

Let E denote the event that there are thrice as many as C+G’s than A+T’s
in a scan. Our interest is in the shaded portion of the Ω space shown in
Figure 3.1(b). The shaded region corresponds to

iCG ≥ 3iAT

as shown. Then
Z ∞ Z 3y
P (E) = XY dY dX
0 0
= 0.7154.

3.3 The Bare Truth about Inferential Statistics


Simply put, inferential statistics is all about the study of deviation from
chance events. In other words, is a given observation a result of
70 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. chance phenomenon (called the null hypothesis or in the genetics par-


lance neutral model), or,
2. is there an underlying nonrandom phenomenon?
The latter is usually more interesting and may actually be an evidence of some
biological basis and a possible candidate for a new discovery (and yet another
research paper in the sea of such findings).
Interestingly, the deviation of the observations from a random model can
be measured (called the p-value). In the context of hypothesis testing this
is possible due to an incredibly simple, yet powerful result called the Central
Limit Theorem. In this section, we will develop the ideas that lead to this
theorem.

3.3.1 Probability distribution invariants


What is common across all probability distributions? Recall that we define
probability distribution as any function that satisfies Kolmogorov’s axioms.
We begin by a very interesting theorem called the Chebyshev’s Inequality
Theorem which says that in any probability distribution, almost all values
of the distribution lie close to the mean. The distance from the mean is in
terms of standard deviation of the distribution and the number of values are
a fraction of all the possible values. For example, no more than 1/4 of the
values are more than 2 standard deviations away from the mean, no more
than 1/9 are more than 3 standard deviations away, no more than 1/16 are
more than 4 standard deviations away, and so on.
The theorem is formally stated below. The proof is surprisingly straightfor-
ward and is based on Markov’s Inequality which is discussed first. This proof
simply uses the properties of integrals.

THEOREM 3.3
(Markov’s inequality) If X is a random variable and a > 0, then
E[|X|]
P (X ≥ a) ≤ .
a

PROOF Let (Ω, F, P ) be the probability space and let

X:Ω→R

be a random variable. Then Ω can be partitioned into disjoint sets Ω1 and


Ω2 , where
Ω1 = {ω ∈ Ω | |X(ω)| < a} and
Ω2 = {ω ∈ Ω | |X(ω)| ≥ a} .
Basic Statistics 71

By the definition of expectation,


Z
E[|X|] = |X(ω)|dP
ZΩ Z
= |X(ω)|dP + |X(ω)|dP.
Ω1 Ω2

Thus, Z
E[|X|] ≥ |X(ω)|dP,
Ω2
since Z
|X(ω)|dP ≥ 0.
Ω1
But, Z Z
|X(ω)|dP ≥ a dP,
Ω2 Ω2
since |X(ω)| ≥ a for all ω ∈ Ω2 and
Z
dP = P (|X| ≥ a) .
Ω2

Hence,
E[|X|] ≥ aP (|X| ≥ a) .

Now we are ready to prove Chebyshev’s inequality which is a direct appli-


cation of Markov’s inequality.

THEOREM 3.4
(Chebyshev’s inequality theorem) If X is a random variable with mean
µ and (finite) variance σ 2 , then for any real k > 0,
1
P (|X − µ| ≥ kσ) ≤ .
k2

PROOF Define a random variable Y as (X − µ)2 . Setting a as


a = (kσ)2 ,
and using Markov’s inequality we obtain
E |X − µ|2
 
P (|X − µ| ≥ kσ) ≤
k2 σ 2
2
σ 1
= 2 2 = 2.
k σ k
72 Pattern Discovery in Bioinformatics: Theory & Algorithms

3.3.2 Samples & summary statistics


Consider a very large population. Assume we repeatedly take samples of
size n from the population and compute some summary statistics. We briefly
digress here to discuss our options here.

Summary statistics. Statistics is the branch of applied mathematics that


deals with the analysis of a large collection of data. The two major branches of
this discipline are descriptive statistics and inferential statistics. Descriptive
statistics is the branch that concerns itself mainly with summarizing a set of
data.
In this section, our focus is on summary statistics which are numerical values
that summarize the data. One is to the study the central tendency, usually
via the mean, median or the mode and the other is to study the variability,
usually via range or variance.
Let’s go back to the professor’s pond. Let xi be the number of T types in
a scan labeled i. Let
x1 , x2 , . . . , xn
be the collection of such real-valued numbers i.e, the counts of type T in n
scans. Let us call this collection S.
The mean µ of S is:
n
1X
µ= xi .
n i=1

The mode of S is the element xi that occurs the most number of times in the
collection. Sorting the elements of S as

xi 1 ≤ xi 2 ≤ . . . ≤ xi n ,

the median is the element xil in this sorted list, where


 n
⌊ ⌋ for n odd,
l = n2 n
2 and 2 + 1 for n even.

The range of S is (xmax − xmin ), where

xmin = min xi ,
i
xmax = max xi .
i

Next, σ is the standard deviation and the variance σ 2 of S is given as:


n
1X
σ2 = (xi − µ)2 .
n i=1
Basic Statistics 73

Back to samples. Recall that we repeatedly take samples of size n from


a large population. We compute the mean µ and variance σ 2 , as discussed
in the last section, for each sample of size n. Clearly, the different samples
will lead to different sample means. The distribution of these means is the
sampling distribution of the sample mean for n.
What can we say about this distribution? What can we say about the
population by studying only the samples of size n?
It is not immediately apparent from what we have seen so far. It turns out
that we can say the following:
1. Let the mean of the sample be

µn−sample .

Let the standard deviation be

σn−sample

(this is sometimes also called the standard error). If

µpopulation

is the mean and


σpopulation
the standard deviation of the (theoretical) population that the sample
is derived from, then
• µpopulation is usually estimated to be µn−sample , and,

• σpopulation is estimated to be n σn−sample .

Thus if we need to halve the standard error, the sample size must be
quadrupled.
2. The sampling distribution of the sample mean is in fact a normal dis-
tribution
N ormal(µn−sample , σn−sample )
(assuming the original population is ‘reasonably behaved’ and the sam-
ple size n is sufficiently large). A weaker version of this is stated in the
following theorem.

THEOREM 3.5
(Standardization theorem) If

Xi ∼ N ormal(µ, σ 2 ),
74 Pattern Discovery in Bioinformatics: Theory & Algorithms

1 ≤ i ≤ n, and
n
1X
X̄ = Xi ,
n i=1
then
X̄ − µ
Z= √ ∼ N ormal(0, 1).
σ/ n

This follows from the Central Limit Theorem.


3. Consider the sample median from the sample data. It has a different
sampling distribution which is usually not normal.
This is just to indicate that not all summary statistics have a normal
behavior.
Now, we look at the theorems that lie behind the observations above. The
first result on the sample mean and variance is easily obtained by simply using
the definition of expectations.

THEOREM 3.6
(Sample mean & variance theorem) If

X1 , X2 , X3 , . . . ,

is an infinite sequence of random variables, which are pairwise independent


and each of whose distribution has the same mean µ and variance σ 2 and
X1 + . . . + Xn
X̄n = ,
n
then
E[X̄n ] = µ
and
σ2
V [X̄n ] = ,
n
i.e., the mean of the distribution of random variable X̄n is µ and its standard
1
deviation σ/n 2 .

PROOF By linearity of expectation we have,


 
X1 + X2 + . . . + Xn
E(X̄n ) = E
n
E[X1 ] + E[X2 ] + . . . + E[Xn ]
=
n
= µ.
Basic Statistics 75

Next,
" P 2 #
Xi
E[X̄n2 ] =E i
n
 
1 X 2 X
= 2E Xi + 2Xi Xj 
n i i6=j
 
1 X X
= 2 E[Xi2 ] + 2 E[Xi Xj ] .
n i i6=j

The last step above is due to linearity of expectations, The random variables
are independent thus for each i and j,
E[Xi Xj ] = E[Xi ]E[Xj ].
Further, for each i,
E[Xi2 ] = σ 2 + µ2 .
Hence we have,
 
1 X X
E[X̄n2 ] = 2  (σ 2 + µ2 ) + 2 µ2 
n i i6=j
2
 
1 2 2 n −n 2
= 2 n(σ + µ ) + 2 µ .
n 2
Recall that
V [X̄n ] = E[X̄n2 ] − E[X̄n ]2 .
Thus,
σ2
 
V [X̄n ] = + µ2 − µ2
n
σ2
= .
n

In the last theorem, we saw that the sample mean is µ. Law of Large Num-
bers gives a stronger result regarding the distribution of the random variable
X̄n . It says that as n increases the distribution concentrates around µ. The
formal statement and proof is given below.

THEOREM 3.7
(Law of large numbers) If
X1 , X2 , X3 , . . . ,
76 Pattern Discovery in Bioinformatics: Theory & Algorithms

is an infinite sequence of random variables, which are pairwise independent


and each of whose distribution has the same mean µ and variance σ 2 , then
for every ǫ > 0,
lim P (|X̄n − µ| < ǫ) = 1.
n→∞

where
X1 + X2 + . . . + Xn
X̄n =
n
is the sample average.

PROOF By Theorem (3.6),

E[X̄n ] = µ

and
σ
V [X̄n ] = √ .
n

For any k > 0, Chebyshev’s inequality on the random variable X̄n gives
 
σ 1
P |X̄n − µ| ≥ k √ ≤ 2.
n k

Thus letting
σ
ǫ=k ,
n
we get
 σ2
P |X̄n − µ| ≥ ǫ ≤ 2 .
ǫ n
In other words, since P is a probability distribution,

 σ2
P |X̄n − µ| < ǫ ≤ 1 − 2 .
ǫ n

Note that since σ is finite and ǫ is fixed,

σ2
lim = 0.
n→∞ ǫ2 n

Thus
lim P (|X̄n − µ| < ǫ) = 1.
n→∞
Basic Statistics 77

3.3.3 The central limit theorem


A much stronger version of the Large Number Theorem is the Central Limit
Theorem which also gives the distribution of the sample mean as the sample
size grows. Assuming the random variables have finite variances, regardless
of the underlying distribution of the random variables, the standardized dif-
ference between the sum of the random variables and the mean of this sum,
converges in distribution to a standard normal random variable.
In this section we give only an indication of the proof of the Central Limit
Theorem, using methods from Fourier analysis. For a more rigorous proof
the reader must consult a standard textbook on probability theory [Fel68].
We first define a characteristic function and state four important properties
that we use in the proof of the theorem.

Characteristic function & its properties. Let FX be the probability


density function of a random variable X. The characteristic function ΦX of
X is defined by: Z ∞
ΦX (t) = eitx F (x)dx.
−∞
Note that here we mean √
i= −1.
For those of you familiar with Fourier analysis, this is just the Fourier trans-
form of the probability density function of X. The four crucial properties of
the characteristic function we will be using are as follows.
1. For any constant a 6= 0 and random variable X,
ΦaX (t) = ΦX (at).
This is easy to verify from the definition.
2. The characteristic functions of random variables also have the nice prop-
erty that the characteristic function of the sum of two independent
random variables is the product of the characteristic functions of the
individual random variables. More precisely, if X = X1 + X2 then
Φ X = ΦX1 Φ X2 .
We leave the proof of this as an exercise for the reader.
3. Using differentiation under the integral sign, we have
Z ∞
ΦX (t) = eitx F (x)dx
−∞
Z ∞

ΦX (t) = ixeitx F (x)dx
−∞
Z ∞
′′
ΦX (t) = (ix)2 eitx F (x)dx
−∞
78 Pattern Discovery in Bioinformatics: Theory & Algorithms

Using the above and Taylor’s formula, we have that for all small enough
t > 0,
Φ′X (0) Φ′′ (0) 2
ΦX (t) = ΦX (0) + t+ t + ...
1! 2!
E[X] E[X 2 ] 2
= 1+i t + i2 t + o(t2 ).
1 2!

4. If
X ∼ N ormal(0, 1)
implying that
1 2
FX (x) = √ e−x /2 ,

then one can show using complex integration that
2
ΦX (t) = e−t /2
.

THEOREM 3.8
(Central limit theorem) If

X1 , X2 , X3 , . . . ,

is an infinite sequence of random variables, which are pairwise independent


and each of whose distribution has the same mean µ and variance σ 2 , then

lim Sn ∼ N ormal(nµ, σ 2 n),


n→∞

where random variable Sn is defined by,

Sn = X 1 + . . . + X n .

Normalizing Sn as Zn , we have the following. If


Sn − nµ
Zn = √ ,
σ n
then
lim Zn ∼ N ormal(0, 1).
n→∞

PROOF Let
Xi − µ
Yi = .
σ
Then Yi is a random variable with mean 0 and standard deviation 1 and
n
1 X
Zn = √ Yi .
n i=1
Basic Statistics 79

Applying properties (1) and (2)


n  
Y t
ΦZn (t) = ΦYi √ . (3.19)
i=1
n

Using Property 3, for all i > 0,

t2
ΦYi (t) = 1 − + o(t2 ). (3.20)
2
It follows from the Equations (3.19) and (3.20),
n
t2

γ(t, n)
ΦZn (t) = 1− + ,
2n n

where
lim γ(t, n) = 0.
n→∞

Hence
n
t2

lim ΦZn (t) = lim 1 − + γ(t, n)
n→∞ n→∞ 2n
2
= e−t /2
.
2
Thus the characteristic function of Zn approaches e−t /2
as n approaches ∞.
By Property 4, we know that
2
e−t /2
is the characteristic function of a normally distributed random variable with
mean 0 and standard deviation 1. This implies (with some more work, which
we omit) that the probability distribution of Zn converges to

N ormal(0, 1)

as n approaches ∞.

This is a somewhat astonishing and a strong result, particularly when you


realize that each Xi can have any distribution, as long as the mean is µ and
the variance is σ 2 . Under these conditions, the sample sum Sn approaches
the normal distribution
N ormal(nµ, σ 2 n)
as n approaches ∞. Very often, it is simpler to use the standard normal
distribution
N ormal(0, 1)
and Zn converges to this distribution as n approaches ∞.
80 Pattern Discovery in Bioinformatics: Theory & Algorithms

3.3.4 Statistical significance (p-value)


An observed event is significant if it is unlikely to have occurred by chance.
The significance of the event is also called its p-value, a real number in the
interval [0, 1]. The smaller the p-value, the more significant the occurrence of
the event.

Theoretical framework & practical connotations. Note that in our


framework of the probability space
(Ω, F, P ),
the p-value of an event E ∈ F is simply the probability of the event
P (E).
The probability distribution P models the ‘chance’ phenomenon (note that
this model does not allow supporting an alternative phenomenon or explana-
tion). It is important to understand this for a correct interpretation of the
p-value.
However, in practice often a binary question needs to be answered:
Is the observed event statistically significant?
To deal with this question, certain thresholds are widely used and more or
less universally accepted.
When p-value (probability) of an event E is
P (E) = p,
then the level of significance is defined to be p × 100%. Fixing a threshold at
α (generally 0 ≤ α ≪ 0.5), an event E is significant at
α × 100%
level if
P (E) ≤ α.
Or, the event is simply statistically significant or interesting.
The following thresholds (also known as significance level), are generally
used in practice:
1. (α = 0.10): denoted by α and the level of significance is 10%.
2. (α = 0.05): denoted by α∗ and the level of significance is 5%.
3. (α = 0.01): denoted by α∗∗ and the level of significance is 1%.
Clearly the result is most impressive when threshold α∗∗ is used and the least
impressive when α is used.
An alternative view of significance level (usually in the context of traditional
hypothesis testing) is as follows. If the null hypothesis is true, the significance
level is the (largest) probability that the observation will be rejected in error.
Basic Statistics 81

Concrete example. We go back to the professor’s koi pond under the


uniform model where we treat the C and G types as one kind and the A and
T types as the other. We wish to test this hypothesis:
The chance of scanning each kind is equal and the 1-scanner is not
favorable towards any particular kind. This is our null hypothesis.
We study this in four steps.
1. Experiment: We will carry out 20 scans using the 1-scanner and count
the number of C+G (to be interpreted as either C or G) kind in each
scan.
2. Probability Space: The discrete probability space

(Ω, 2Ω , MP )

is defined as follows:
Ω = {0, 1, 2, . . . , 20}.
By our null hypothesis, MP is defined by the distribution

Binomial(20, 1/2).

Thus,  
20 ω
MP (ω ∈ Ω) = p (1 − p)20−ω .
ω
3. Outcome of the experiment: Let X denote the number of C+G types
in a scan. Suppose in our experiment of 20 scans we see 16 C+G types.
Then
20
X
P (X ≥ 16) = MP (k)
k=16
20    20
X 20 1
=
k 2
k=16
= 0.0573

In other words,
p-value = 0.0573

4. Significance evaluation: We use α∗ and α below to evaluate the statis-


tical significance.
• Using a threshold of α∗ , since p-value > α∗ , we say: ‘Observing 16
C+G kind is not statistically significant at the 5% level’. In other
words, at the 5% significance level, the null hypothesis is true.
82 Pattern Discovery in Bioinformatics: Theory & Algorithms

• Using a threshold of α, since p-value < α, we say: ‘Observing 16


C+G kind is statistically significant at the 10% level.’ In other
words, at the 10% significance level, the null hypothesis is false.
Our conclusion is as follows: At the 5% significance level, the observation
is not ‘surprising’. But at the 10% significance level, the observation is
‘surprising’.

3.4 Summary
The chapter was a whirlwind tour of basic statistics and probability. It is
worth realizing that this field has taken at least a century to mature, and what
it has to offer is undoubtedly useful as well as beautiful. A correct modeling
of real biological data will require understanding not only of the biology but
also of the probabilistic and statistical methods.
The chapter has been quite substantial in keeping with the amount of work
in bioinformatics literature that use statistical and probabilistic ideas. In-
formation theory is an equally interesting field that deserves some discussion
here. Since we have already introduced the reader to random variables and
expectations, a few basic ideas in information theory are explored through
Exercise 21.

3.5 Exercises
Exercise 12 Consider the Kolmogorov’s axioms of Section (3.2.1.3). Show
that under these conditions, for each E ∈ F,
0 ≤ P (E) ≤ 1.

Exercise 13 (alternative axioms) We call the following the SB Axioms


on
P : F → R≥0
where Ω is finite:
1. P (∅) = 0.
2. P (Ω) = 1.
3. For each E1 , E2 ∈ F,
P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 ).
Basic Statistics 83

Prove the following statements.


1. Kolomogorov Axiom 1 follows from the SB Axioms.
2. P (∅) = 0 follows from Kolomogorov Axioms.
3. The Kolmogorov and SB Axioms are equivalent.

Exercise 14 (Exclusive-inclusive principle)


1. Show that
S1 ≥ S2 ≥ . . . ≥ Sn ,
where the Si ’s are defined as in Theorem (3.2).
2. Show that
(
≤ ki=1 (−1)i+1 Si
P
for k odd,
P (E1 ∪ E2 ∪ . . . ∪ En )
≥ ki=1 (−1)i+1 Si
P
for k even.

Hint: Use Equation (3.5).

Exercise 15 1. Let pr(x) be defined as follows (Eqn (3.11)),




 2/9 if x = A,
 2/9 if x = C,


pr(x) = 2/9 if x = G,
2/9 if x = T,




1/9 otherwise.

For each uv ∈ Ω, let

MP (uv) = pr(u)pr(v).

Show that MP is a probability mass function.


2. Let pr(x) be defined as follows (Eqn (3.13)),


 1/9 if x = A
 3/9 if x = C


pr(x) = 3/9 if x = G
1/9 if x = T




1/9 otherwise

For each uv ∈ Ω, let

MP (uv) = pr(u)pr(v).

Show that MP is a probability mass function.


84 Pattern Discovery in Bioinformatics: Theory & Algorithms
P
Hint: What is uv∈Ω p(u)p(v)?

Exercise 16 Consider a k-scanner, CG-rich model of the koi pond scenario


used in this chapter.
1. What is the probability of having a nonhomogenous scan?
2. What is the average number of A’s in a scan?

Exercise 17 Show that MP as defined in Equation (3.17) as


Y
MP (iA , iC , iG , iT ) = P oisson(iz ; λz )
z={A, C, G, T}

satisfies the Kolmogorov’s axioms.

Exercise 18 (Inequality property of expectations) Based on the defi-


nition of E[X], prove the following: If
X1 ≤ X2
then
E[X1 ] ≤ E[X2 ].

Exercise 19 Let a function


fλ (k)
defined on the integers Z i.e., on
. . . , −1, 0, 1, . . . ,
for a given λ > 0, be as follows:
fλ (k) = e−λ , if k = 0,
−λ |k|
e λ
= , otherwise.
2|k|!
1. Show that fλ is a probability mass function.
2. If the probability mass function of X is fλ , we say a random variable
X ∼ pseudoP oisson(λ).
If two independent random variables X1 and X2 are such that
X1 ∼ pseudoP oisson(λ) and
X2 ∼ pseudoP oisson(λ),
and
X = X1 + X2 ,
then determine pr(X = k).
Basic Statistics 85

Hint:

1. Show that

X
f (k; λ) = 1.
k=−∞

2. The convolution of functions f1 and f2 is written f1 ∗ f2 and is given by


Z ∞
f1 ∗ f2 (x) = f1 (t)f2 (x − t)dt.
−∞

We assume that the functions f1 and f2 are defined on the whole real
line, extending them by 0 outside their domains of definition otherwise.
For the intrepid reader: can you prove the following?

(a)
f1 ∗ f2 = f2 ∗ f1 .

(b) The probability distribution of the sum of two independent random


variables is the convolution of each of their distributions.

Back to our problem: What is the convolution of fλ with itself?

Exercise 20 (Normal random variables) If

X1 ∼ N ormal(µ1 , σ12 ) and


X2 ∼ N ormal(µ2 , σ22 ),

are two independent random variables, then show that

X = X1 + X2 ∼ N ormal(µ1 +µ2 , σ12 +σ22 )

Hint: Use convolution (see Exercise 19) of the two functions:


Z ∞
fX (x) = fX1 (x − y)fX2 (y)dy.
−∞

Exercise 21 (Information theory) In information theory, the entropy of


a random variable X is a measure of the amount of ‘uncertainty’ of X. In
the following, we consider discrete random variables.
86 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. Let pr(k) denote the probability

pr(X = k).

Then the entropy of X is defined to be


X 1
H(X) = pr(k) log
pr(k)
k
X
=− pr(k) log pr(k).
k

(a) Using the definition of entropy, show that

H(X) = E[− log(pr(x))].

(b) Show that the entropy H(X) is maximized under the uniform dis-
tribution i.e, each outcome is equally likely.
2. Let X1 and X2 be two discrete random variables and let

pr(ki )

denote pr(Xi = ki ), i = 1, 2. Let

pr(k1 , k2 )

denote the joint probability of X1 = k1 and X2 = k2 .


(a) The joint entropy is defined to be
X 1
H(X1 , X2 ) = pr(k1 , k2 ) log
pr(k1 , k2 )
k1 ,k2
X
=− pr(k1 , k2 ) log pr(k1 , k2 ).
k1 ,k2

(b) The conditional entropy of X1 given X2 is defined to be


X pr(k2 )
H(X1 |X2 ) = pr(k1 , k2 ) log .
pr(k1 , k2 )
k1 ,k2

Show that
H(X|Y ) = H(X, Y ) − H(Y ).

3. Mutual information is a measure of information that can be obtained


about one random variable by observing another. This is given as:
X pr(k1 , k2 )
I(X1 ; X2 ) = pr(k1 , k2 ) log .
pr(k1 )pr(k2 )
k1 ,k2
Basic Statistics 87

Show that
(1) I(X1 ; X2 ) = H(X1 ) − H(X1 |X2 )
(2) I(X1 ; X2 ) = I(X2 ; X1 ).

Hint: To show (1)(b) above use induction on k (base case k = 1).


A few interesting nuggets about entropy and random variables:
1. Note that when X is a continuous random variable with density function
f (x), H(X) is defined as
Z ∞
1
H(X) = f (x) log dx.
−∞ f (x)

2. The normal distribution N ormal(µ, σ 2 ) has the maximum entropy over


all distributions on reals with mean µ and variance σ 2 . Thus if we know
only the mean and variance of a distribution, it is very reasonable to
assume it to be a normal distribution.
3. The uniform distribution on the interval [a, b] is defined by the proba-
bility density function f (x) as
 1
when a ≤ x ≤ b,
f (x) = b−a
0 otherwise.
The uniform distribution is a maximum entropy distribution among all
continuous distributions defined on [a, b] and 0 elsewhere.

Exercise 22 (Expected minimum, maximum values)


1. Show that the expected value of the minimum of n independent random
variables, uniformly distributed over (0, 1), is
1
.
n+1

2. Show that the expected value of the maximum of n independent random


variables, uniformly distributed over (0, 1), is
n
.
n+1

3. Consider the MaxMin Problem described in Exercise 10.


If each edge weight wt(v1 , v2 ) is distributed uniformly over (0,1), then
what is the expected path length?
(Recall that the path length is the maximum over all paths and in each
path the length is the minimum of all the edge weights.)
88 Pattern Discovery in Bioinformatics: Theory & Algorithms

Hint: 1. & 2. Let n = 2. The value of the min and max functions respectively,
is constant along the dashed lines shown below. We need to integrate these
two functions over the unit area. Note that for the min function, in the lower
triangle min(x, y) = y and in the upper triangle min(x, y) = x. The reverse
is true for the max function.

min max

3. Use 1 and 2. and the expected path depth (see Exercise 11).

Comments
My experience with students (and even some practicing scientists) has been
that there is a general haziness when it comes to interpreting or understand-
ing statistical computations. I prefer Kolmogorov’s axiomatic approach to
probability since it is very clean and elegant. In this framework, it is easy
to formulate the questions, identify the assumptions and most importantly
convince yourself and others about the import and the correctness of your
models.
A small section of this chapter belabors on the proofs of some fundamental
theorems in probability and statistics. This is just to show that most of the
results we need to use in bioinformatics are fairly simple and the basis of
these foundational results can be understood by most who have a nodding
acquaintance with elementary calculus.
In my mind the most intriguing theorem in this area is the Central Limit
Theorem (and also the Large Number Theorem). While the latter, quite sur-
prisingly, has an utterly simple proof, the former requires some familiarity
with function convergence to appreciate the underlying ideas. In a sense it is
a relief to realize that the proof of the most used theorem in bioinformatics is
not too difficult.
Chapter 4
What Are Patterns?

Patterns lie in the eyes of the beholder.


- anonymous bioinformaticist

4.1 Introduction
Patterns haunt the mind of a seeker. And, when the seeker is equipped
with a computer with boundless capabilities, limitless memory, and easily
accessible data repositories to dig through, there is no underestimating the
care that should be undertaken in defining such a task and interpreting the
results. For instance, the human genome alone is 3,000,000,000 nucleotides
long! A sequence of 3 billion characters defined on an alphabet size of four
(A, C, G, T) offers endless possibilities for useful as well as possibly useless
nuggets of information.1
The genomes are so large in terms of nucleotide lengths that a database 2
reports the sizes of the genomes by its weight in picograms (1 picogram = one
trillionth of a gram). It is also important to note that genome size does not
correlate with organismal complexity. For instance, many plant species and
some single-celled protists, have genomes much larger than that of humans.
Also, genome sizes (called C-values ) vary enormously among species and that
relationship of this variation to the number of genes in the species is unclear.
The presence of noncoding regions settles the question to a certain extent
in eukaryotes. The human genome, for example, comprises only about 1.5%
protein-coding genes, with the remaining being various types of noncoding
DNA, especially transposable elements.3
Hence our tools of analysis or discovery must be very carefully designed.
Here we discuss models that are mathematically and statistically ‘sensible’
and hopefully, this can be transposed to biologically ‘meaningful’.

1 Also,Ramsey theory states that even random sequences when sufficiently large will contain
some sort of pattern (or order).
2 Gregory, T.R. (2007). Animal Genome Size Database. http://www.genomesize.com
3 As reported by the International Human Genome Sequencing Consortium 2001.

89
90 Pattern Discovery in Bioinformatics: Theory & Algorithms

4.2 Common Thread


What is a pattern? In our context, a pattern is a nonrandom entity. We
need a characterization that is amenable to objective scrutiny or an automatic
discovery. We define a pattern to be a nonunique phenomenon.4 A pattern
could be rare or rampant. But, a unique event is not a pattern. The term
pattern and motif is used interchangeably in the book. The choice of one
over the other in different contexts is mainly to be consistent with published
literature.
We start with the premise that we do not know the questions, but believe
the answers to be embedded in the observations or a given body of data.
What must the question be?
The hope is that if a phenomenon appears multiple times it is likely to tell
a tale of interest. In nature, if a pattern is seen in a structure, or in a DNA
coding segment, then perhaps it is a sign that nature is reusing the ‘objects’
or ‘modules’ and could be of potential interest.
Is it possible to elicit invariant features independent of the precise definition
of the ‘phenomenon’ ?

4.3 Pattern Duality


Let the input data be I. A pattern is a nonunique phenomenon observed in
I. Let this pattern, say p, occur k distinct times in the data. The k occurrences
are represented as an occurrence list or a location list, Lp , written as

Lp = {i | p occurs at location i in I}. (4.1)

If I were m-dimensional, then a location in the input could be specified by an


m-tuple
i = (i1 , i2 , . . . , im ).

We call the description of p as a combinatorial feature and its location list


Lp as a statistical feature. This is so called since the former offers a description
of the pattern and the latter denotes its frequency in the data.

Statistical definition (using Lp ). We next address the following curiosity:


Given I, does L, a collection of positions on I define a pattern p?

4 See Exercises 32, 33 and 34 for possible alternative models.


What Are Patterns? 91

There are at least two issues with this:


1. Firstly, there may be no pattern p that occurs at the positions in L.
2. Secondly, if there is indeed a p that occurs at each position i ∈ L, it is
still possible that p may occur at
j 6∈ L.

Thus this definition of a pattern is ‘well-defined’ or ‘nontrivial’ since any


arbitrary collection of positions does not qualify as a pattern specification.

Back to duality. It is interesting to note that given an input I, a pattern


is completely specified by
1. either its combinatorial description, p, written as (I, p),
2. or its statistical description, Lp , written as (I, Lp ).
In other words, (I, p) is determined by (I, Lp ) and conversely.

Fuzziness of an occurrence i ∈ L. Consider the following example where


the pattern is defined as an uninterrupted sequence such as
p = C G T T T.
The occurrence of this pattern in an input is shown as a box below:
I=AACGTTTCG
↑ ↑ ↑ ↑ ↑
i= 1 2 3 4 5 6 7 8 9
What is the occurrence position, i, of p ? Should i be 3 or 4 or 5 or 6 or 7?
It might be easier to follow some convention, such as the left most position
(i = 3). But this has an unreasonable bias to the left. Perhaps a more
unbiased choice of i is 5. Thus it is merely a convention and it is important
to bear that in mind.
To keep the generality, without being bogged down with technicalities, we
assume that two positions
i ∈ Lp1 and j ∈ Lp2
are equal, written as
i =δ j,
when the following holds
|i − j| ≤ δ(i, j),
for some appropriately chosen δ(i, j) ≥ 0. To avoid clutter, we omit the
subscript δ in =δ . Thus, for i ∈ Lp1 and j ∈ Lp2 ,
(i = j) ⇔ (|i − j| ≤ δ(i, j)) .
This is also discussed in the two concrete examples of Section 4.4.4.
92 Pattern Discovery in Bioinformatics: Theory & Algorithms

Back to pattern specification. To summarize, a pattern is uniquely de-


fined by

1. specifying the form of p (such as a string or as a permutation or a partial


order or a boolean expression and so on) and

2. its occurrence (i.e., what condition must be satisfied to say that p occurs
at position i in the input).

Thus a pattern can be specified in many and varied ways, but its location
list, mercifully, is always a set. This gives a handle to extract the common
properties of patterns, independent of its specific form.

4.3.1 Operators on p
What operators should (or can) be defined on patterns? Can we identify
them even before assigning a specific definition to a pattern?
Here we resort to the duality of the patterns. So, what are the operations
defined on location lists? Since sets have union and intersection operations
defined on them, then it is meaningful to define two operators as follows.
The plus operator (⊕):

p = p1 ⊕ p2 ⇔ Lp = Lp1 ∩ Lp2 .

The times operator (⊗):

p = p1 ⊗ p2 ⇔ Lp = Lp1 ∪ Lp2 .

Further, the following properties hold which again are a direct consequence
of the corresponding set operation.

p1 ⊕ p2 = p2 ⊕ p1
p1 ⊗ p2 = p2 ⊗ p1
(p1 ⊕ p2 ) ⊕ p3 = p1 ⊕ (p2 ⊕ p3 )
(p1 ⊗ p2 ) ⊗ p3 = p1 ⊗ (p2 ⊗ p3 )
(p1 ⊗ p2 ) ⊕ p3 = (p1 ⊕ p3 ) ⊗ (p2 ⊕ p3 )
(p1 ⊕ p2 ) ⊗ p3 = (p1 ⊗ p3 ) ⊕ (p2 ⊗ p3 )

4.4 Irredundant Patterns


Let P (I) be the set of all patterns in an input I.
What Are Patterns? 93

Patterns by their very nature have a large component of repetitiveness. A


very desirable requirement is to identify (and eliminate) these duplications.5
So a very natural question arises:

Is there some redundancy in P (I)?


In other words, is it possible to trim P (I) of its belly-fat so to say, without
losing its essence?
This is the motivation for the next definition. For p ∈ P (I) and a set of
patterns
{p1 , p2 , . . . , pl } ⊂ P (I),
with each pi 6= p, let

Lp = Lp1 ∪ Lp2 ∪ . . . ∪ Lpl .

Then p is redundant with respect to (w.r.t.) these l patterns which we write


as
{p1 , p2 , . . . , pl } ֒→ p.
Also the pattern p is defined as

p1 ⊗ p2 ⊗ . . . ⊗ pl = p.

If p is such that for each l ≥ 1 there exists no p1 , p2 , . . . , pl ∈ P (I), with

{p1 , p2 , . . . , pl } ֒→ p,

then p is called irredundant.


Note that redundancy and irredundancy of patterns is always with respect
to I. There is no notion of irredundancy or redundancy for an isolated form
p.
Also, since redundancy is defined in terms of sets (location lists), this holds
across the board for patterns of all specifications.

4.4.1 Special case: maximality


The reader may already have an intuitive sense for maximality, at least for
string patterns. For a concrete example, see the three boxed occurrences of

p = A T T G C,

on an input string I:

I=GA ATTGC GG ATTGC CAC ATTGC CT

5 The reader may be familiar with the the use of maximality which is widely used to remove

redundancies.
94 Pattern Discovery in Bioinformatics: Theory & Algorithms

and
Lp = {3, 10, 18}.

Intuitively, what are the nonmaximal patterns? Assume, for convenience, that
a string pattern must have at least two elements. Then all the nonmaximal
patterns p1 , p2 , . . . , p9 are shown below:

p=ATTGC p1 = A T T G p2 = A T T p3 =AT
p4 = T T G C p5 = T T G p6 =TT
p7 = T G C p8 =TG
p9 =GC

Note that if the input were I ′ , then the occurrences of p8 is shown below:

I′ = G A A T T G C G G A T T G C C A C A T T G C C T G

Here p8 , a substring of p, is considered maximal 6 since it has an independent


fourth occurrence.
However, here we define maximality as a special case of irredundancy. If l
is restricted to 1, this is a special case and p is called nonmaximal w.r.t. p1 .
If p is such that there exists no p1 ∈ P (I) with

p1 ֒→ p,

then p is called maximal.


In our concrete example, we observe that (using the ‘fuzzy’ definition of
position i), for 1 ≤ j ≤ 9,
Lp = Lpi .

This says that at every occurrence of p, there is also an occurrence of pj for


each 1 ≤ j ≤ 9. Thus one of the patterns must be more informative than all
others and this pattern is the maximal pattern p.
Thus, if for two distinct patterns p1 6= p2 ,

Lp1 = Lp2 ,

then one of them must be nonmaximal. This agrees with the intuitive notion
that if a multiple motifs occur in the same positions, then the largest (or most
informative) is the maximal motif.
Note that maximality is always in terms of an input I. There is no notion
of maximality on an isolated string p.

6 It is possible to have an alternative view and this is discussed as Exercise 84.


What Are Patterns? 95

4.4.2 Transitivity of redundancy


Let P (I) be the set of all patterns on an input I. Let p ∈ P (I) be redundant
w.r.t. p1 , p2 , . . . , pl ∈ P (I), i.e.,

{p1 , p2 , . . . , pl } ֒→ p.

Further, let each of pi , 1 ≤ i ≤ l, be redundant as follows

P1 = {p11 , p12 , . . . , p1l1 } ֒→ p1 ,


P2 = {p21 , p22 , . . . , p2l2 } ֒→ p2 ,
.. ..
. .
Pl = {pl1 , pl2 . . . , plll } ֒→ pl .

with 
p11 , p12 , . . . , p1l1 , 

p21 , p22 , . . . , p2l2 , 

.. .. .. ∈ P (I).
. . .  

pl1 , pl2 , . . . , plll .

Then p is redundant as follows


 

 p11 , p12 , . . . , p1l1 , 

p21 , p22 , . . . , p2l2 , 

 
.. .. .. ֒→ p


 . . .  

pl1 , pl2 , . . . , plll
 

In other words, if if p is redundant with respect to

p 1 , p 2 , . . . , pl ,

and each pi in turn is redundant w.r.t. set Pi , then p is redundant with respect
to the union of sets, i.e., [
Pi ֒→ p.
i

4.4.3 Uniqueness property


A direct consequence of transitivity of redundancy is the following impor-
tant theorem. See Exercise 23 for an illustration.

THEOREM 4.1
Let P (I) be the set of all patterns that occur in a given input I. Then the
collection of irredundant patterns,

Pirredundant (I) ⊂ P (I),


96 Pattern Discovery in Bioinformatics: Theory & Algorithms

is unique.

COROLLARY 4.1
Let P (I) be the set of all patterns that occur in a given input I. Then the
collection of maximal patterns,

Pmaximal (I) ⊂ P (I),

is unique.

Since the construction of the irredundant patterns result in a unique col-


lection, i.e., they are independent of the order of the construction. This set is
also called a basis since any pattern

p 6∈ P (I),

can be written as
p1 ⊗ p2 ⊗ . . . ⊗ pl = p,
for some
p1 , p2 , . . . , pl ∈ P (I).
Thus the basis is defined as follows:

Pbasis (I) = Pirredudant (I)


= {p ∈ P (I) | p is irredundant}.

Note that
Pbasis (I) ⊆ Pmaximal (I) ⊆ P (I).

4.4.4 Case studies


Here we sketch two concrete examples. The focus is on studying the relation
between p and Lp .

Concrete example 1 (string pattern). Let the input be defined on Σ.


An element in Σ is called a solid character. The dot character, ‘.’, is a wild
card and stands for any character in that position. A string pattern is a
sequence on
Σ ∪ {‘.’}.
A pattern of length one or one that does not start or end with a solid character
is called a trivial pattern. If for 1 ≤ i ≤ l, the following holds

p ֒→ pi ,

then we write
p ֒→ p1 , p2 , . . . , pl
What Are Patterns? 97

For ease of exposition, we consider the input to be four sequences as shown


below and P (I) is the set of all nontrivial patterns.
 

 AB, BC, CD, DE, 

A.C, B.D, C.E,

 


 

s1 = A B C D E F, ABC, CDE,

 


 

s2 = A G C D E G, AB.D, A.CD, A..D,
 
P (I) = .
s3 = A B G D E H, 
 BC.E, B.DE, B..E, 

s4 = A B C G E A. AB..E, A..DE, A.C.E,

 


 

ABC.E, A.CDE, AB.DE,

 


 

A.CDE, AB.DE, ABC.E
 

Patterns & location lists. In this example a position is a two-tuple (i, j)


which denotes position j in the ith sequence. Following the convention the
that position represents the leftmost position of the pattern in the input, we
have the following:

LABC.E = {(1, 1), (2, 1), (3, 1), (4, 1)},


LBC.E = {(1, 2), (2, 2), (3, 2), (4, 2)},
LABC = {(1, 1), (2, 1), (3, 1), (4, 1)},
LBC = {(1, 2), (2, 2), (3, 2), (4, 2)}.

But each location in LBC.E is one position away from a position in LABC.E .

LBC.E = (i, j + δ) | (i, j) ∈ LABC.E ,

where δ = 1. The same holds for LBC . Intuitively, the lists should be the
same since they are capturing the same common segments in the input with
a phase shift. Thus, the location lists are ‘fuzzily’ equal, i.e.,

LABC.E = LBC.E = LABC = LBC .

Similar arguments hold for the following equalities:

LA.CDE = LCD = LCDE = LA.CD ,


LAB.DE = LB.D = LAB.D = LB.DE ,
LAB..E = LAB = LB..E ,
LA..DE = LDE = LA..D ,
LA.C.E = LA.C = LC.E .

Also, [
LA..DE = LAB.DE LA.CDE ,
[
LA.C.E = LA.CDE LABC.E ,
98 Pattern Discovery in Bioinformatics: Theory & Algorithms
[
LAB..E = LAB.DE LABC.E ,
[ [
LA...E = LA.CDE LAB.DE LABC.E .
What are the redundancies ? The redundancies are as shown below (redun-
dancy with restriction l = 1 gives the maximal elements):

AB..E ֒→ AB, B..E 

A..DE ֒→ DE, A..D {AB.DE, A.CDE} ֒→ A..DE




A.C.E ֒→ A.C, C.E {A.CDE, ABC.E} ֒→ A.C.E

maximal
ABC.E ֒→ BC, ABC, BC.E   {AB.DE, ABC.E} ֒→ AB..E
A.CDE ֒→ CD, CDE, A.CD  {A.CDE, AB.DE, ABC.E} ֒→ A...E



AB.DE ֒→ B.D, AB.D, B.DE

Thus

Pmaximal (I) = {AB..E, A..DE, A.C.E, ABC.E, A.CDE, AB.DE},


Pbasis (I) = {A.CDE, AB.DE, ABC.E}.

Concrete example 2 (permutation pattern). A permutation pattern is


a set that appears in any order in the input. A pattern of length one is called
a trivial pattern.
For ease of exposition, we consider the input to be two sequences and P (I)
is the set of all nontrivial patterns.

s1 = d e a b c,
P (I) = {{a, b}, {a, b, c}, {e, d}, {a, b, c, d, e}} .
s2 = c a b e d.

Patterns & location lists. In this example a position is a two-tuple (i, j)


which denotes position j in the ith sequence. Following the convention the
that position represents the leftmost position of the pattern in the input, we
have the following:

L{a,b} = {(1, 3), (2, 2)},


L{a,b,c} = {(1, 3), (2, 1)},
L{e,d} = {(1, 1), (2, 4)},
L{a,b,c,d,e} = {(1, 1), (2, 1)}.

Each pattern is in the same two collection of segments in the input. Thus by
using the sizes of the patterns, i.e.,

|{a, b, c, d, e}| = 5 and |{a, b}| = 2,

it is possible to guess if one pattern is contained in the other (in this example
the occurrence is without gaps). Hence

L{a,b} = (i, j + δ(i, j)) | (i, j) ∈ L{a,b,c,d,e} ,
What Are Patterns? 99

where
δ(1, 1) = 2 and δ(2, 1) = 1.
Thus the value of the δ’s is clear from the context. Using similar arguments,
we get the following:

L{a,b} = L{a,b,c} = L{e,d} = L{a,b,c,d,e} .

The redundancies are shown below.



{a,b,c} ֒→ {a,b}, 
{a,b,c,d,e} ֒→ {a,b,c}, maximal
{a,b,c,d,e} ֒→ {d,e}.

Thus
Pmaximal (I) = Pbasis (I) = {{a, b, c, d, e}}.

4.5 Constrained Patterns


The reality of the situations may sometimes demand a constrained version
of the patterns. A combinatorial constraint is one that is applied on the form
of the pattern. For instance, a pattern of interest may have to follow some
density constraint or size constraint (see Chapter 6). A pattern specified as a
boolean expression may have to have a specific form such as a monotone form
(see Chapter 14).
A statistical constraint is one that is applied on the location list of the
pattern. It is possible to impose a quorum constraint, k and for a pattern p
to meet the quorum constraint, the following must hold

|Lp | ≥ k.

Thus the pattern must occur at least k times in the input to meet this quorum
constraint.

4.6 When is a Pattern Specification Nontrivial?


Often in real life, confronted with a daunting task of understanding or
mining on a very large set, there is a need to specify a pattern, whose definition
can then be used to extract patterns objectively from the data. Then it is an
extremely valid question to ask:
100 Pattern Discovery in Bioinformatics: Theory & Algorithms

Is the specification nontrivial?

Here, we suggest one such condition.


The specification should be such that there must exist some collections of
positions in the input, L, that do not correspond to a pattern. See Section 4.3
for a detailed discussion. For an input I, let L denote all possible such sets,
i.e.,

L = {L | L is a collection of positions i of I} .

Consider the collection of sets that correspond to patterns by the specification:

L′ = {L ∈ L | L = Lp for some pattern p} .

The following property is desirable (ǫ is small constant):

|L′ |
< ǫ.
|L|

The specification of a pattern should not be so lax that any arbitrary collec-
tion of segments in the input qualify to be a pattern. In other words, the
probability of a list of positions to correspond to a pattern p should be fairly
low i.e.,

pr(L ∈ L′ ) < ǫ.

4.7 Classes of Patterns

We conclude the chapter by giving the reader a taste of different possible


classes of patterns in the following table. Note that the pattern form and
the pattern occurrence must be specified for an unambiguous definition. For
brevity, we omit the details and appeal to the reader’s intuition in the following
examples.
What Are Patterns? 101

pattern p input I occurrence


(strings)
string
pattern
s1 = A C T T C G
(1a) CTT s2 = C C G T C appears exactly
s3 = C T T C C G

s1 = A C T T C G
appears with
(1b) C.TC s2 = C C G T C wild card
s3 = C T T C C G

s1 = A C T T C G
appears with gap
(1c) C T T-2,3-G s2 = C C G T C
of 2 or 3 wild cards
s3 = C T T C C G

s1 = A C|G T C C G appears exactly


(1d) GTC s2 = C C G T C (but input has multiple
s3 = G|T T C C C G elements in places)

permutation
pattern
s1 = g2 g4 g1 g3 g6
s2 = g7 g9 g1 g2 g3 g4 appears together
(2a) {g1 , g2 , g3 , g4 }
in any order
s3 = g8 g3 g1 g4 g2

s1 = g2 g5 g1 g6 g3 appears together
(2b) {g1 , g2 , g3 } s2 = g4 g9 g1 g2 g3 g7 in any order,
s3 = g6 g3 g1 g8 g2 with at most 1 gap

s1 = g2 g5 g1 g4 g6 appears together
(2c) {g1 , g2 , g4 } s2 = g4 g9 g1 g2 g3 g4 in any order,
s3 = g6 g3 g1 g4 g2 in fixed window size 4

s1 = m 2 m 1 m 2 m 5 m 6 appears together
(2d) {m1 (2), m2 } in any order, but
s2 = m 4 m 9 m 1 m 2 m 2 m 4
m2 appears 2 times
102 Pattern Discovery in Bioinformatics: Theory & Algorithms

pattern p input I occurrence


partial order motif strings
3 q1 = 7 1 2 3 4 in cluster,
S 4 E q2 = 3 1 4 2 8 1 precedes 2
(3a) 1 2 q3 = 9 4 1 3 2 (without gap)

3 q1 = 7 1 2 6 3 4 in cluster,
S 4 E q2 = 3 7 1 4 2 8 1 precedes 2
(3b) 1 2 q3 = 9 4 1 3 8 2 (with gap)

q1 = 1 7 2 6 3 4 2 precedes 3
sequence pattern q2 = 2 3 1 4 2 7 at each
(3c) 2→3 q3 = 9 4 1 2 8 3 occurrence

(4) topological motif graph

isomorphic
to a subgraph

bicluster motif 2D array


.20 .50 .20 .50
.80 .10 .60 .70 values along col
cols 1,3
(5a) .75 .60 .55 .80 within δ = 0.1
rows 2,3,5
.10 .40 .30 .60
.78 .30 .57 .90

.80 .50 .59 .50


.60 .10 .60 .55 most
cols 1,3,4
(5b) .75 .60 .35 .57 values along col
rows 1,2,3,5
.10 .40 .30 .60 within δ = 0.1
.78 .30 .57 .58

(6) boolean expression incidence matrix


m1 m2 m3 m4
1 0 1 0 expression
m1 ∧ (m2 ∨ m3 ) 1 1 1 1 evaluates to TRUE
1 1 1 0 (in rows 1, 2, 3)
1 0 0 1
What Are Patterns? 103

4.8 Exercises
Exercise 23 (Nonuniqueness) The chapter discussed a scheme, using re-
dundancy, to trim P , the set of all patterns in the data to produce a unique
set, say P ′ .
Construct a specification of a pattern and an elimination scheme where the
resulting P ′ is not unique.

Hint: Let the pattern p be an interval (l, u) on reals. If two intervals overlap
and at least half of one interval is in the overlap region, then the smaller
interval is eliminated from the set of intervals P . Concrete example:

P = {p1 = (1, 3), p2 = (2, 6), p3 = (4, 12)}

1. Consider p1 and p2 ; p1 is eliminated. Then between p2 and p3 , p2 is


removed. Thus
P ′ = {p3 }.

2. Consider p3 and p2 ; p2 is removed. Then between p1 and p3 , none is


removed. Thus
P ′ = {p1 , p3 }.

Exercise 24 (Trend patterns) Let s be a sequence of real values. Discuss


how the problem can be framed to extract common pattern of (additive) trends
in the data?

Hint: Use forward differences.

Exercise 25 (Nonmaximal partial order motif ) Consider a partial order


motif B on n elements, which captures all the common order information of
the elements across all its occurrences and there are no gaps in the occurrences
(See (3a) of the table in Section 4.7).
Give a specification of a nonmaximal partial order motif.

Hint: A nonmaximal partial order motif is on a subset of the n elements


where the subsets occur without gaps in the input. How can these elements
be identified in the partial order? Chapter 14 discusses partial order motifs.
104 Pattern Discovery in Bioinformatics: Theory & Algorithms

Exercise 26 (Sequence motif ) Let s be an input sequence defined on an


alphabet Σ. Let
m = σ 1 σ 2 . . . σl ,
where σi ∈ Σ, then
Π(m) = {σ1 , σ2 , . . . , σl }.
m is a sequence motif and at each occurrence this sequence may be interrupted
with at most g elements from

Σ \ Π(m).

For example, occurrence of


m = σ1 σ2 σ3
with gap, g = 2, is shown below:

σ5 σ4 σ1 σ6 σ2 σ4 σ5 σ3 σ2 .

Define redundancy on sequence motifs

m1 , m2 , . . . , mr ,

where for each i 6= j, 1 ≤ i < j ≤ r, the following holds

Π(mi ) ∩ Π(mj ) = ∅.

Hint: Define the binary ⊕ operator. See also Chapter 14.

Exercise 27 (Nontrivial, nonmaximal boolean expressions) See (6) of


the table in Section 4.7 for an example of boolean expression motifs.
1. Note that given a an array of n boolean values corresponding to n boolean
variables, there always exists an expression on the n variables that eval-
uates to TRUE for exactly these assignments.
Discuss possible specifications for nontrivial boolean expression motifs.
2. Discuss the possible interpretation of a nonmaximal boolean expression.
Hint: 1. See monotone boolean expressions in Chapter 14 2. See redescrip-
tions in Chapter 14. What is the relationship between the two, if any?

Exercise 28 Assuming the following

p = p1 ⊕ p2 ⇔ Lp = Lp1 ∩ Lp2 ,
p = p1 ⊗ p2 ⇔ Lp = Lp1 ∪ Lp2 ,
What Are Patterns? 105

show that the following statements hold:

1. p1 ⊕ p2 = p2 ⊕ p1 .
2. p1 ⊗ p2 = p2 ⊗ p1 .
3. (p1 ⊕ p2 ) ⊕ p3 = p1 ⊕ (p2 ⊕ p3 ).
4. (p1 ⊗ p2 ) ⊗ p3 = p1 ⊗ (p2 ⊗ p3 ).
5. (p1 ⊗ p2 ) ⊕ p3 = (p1 ⊕ p3 ) ⊗ (p2 ⊕ p3 ).
6. (p1 ⊕ p2 ) ⊗ p3 = (p1 ⊗ p3 ) ⊕ (p2 ⊗ p3 ).

Hint: Use the properties of sets.

Exercise 29 (Operators on string patterns) Consider (1a)-(1c) in the


table of classes of patterns of Section 4.7 where each is a string pattern.
Define operators ⊕ and ⊗ in terms of the pattern descriptions for the three
cases.

Exercise 30 (Homologous sets) Consider (1d) in the table of patterns of


Section 4.7 where the pattern is defined on a string of sets (not just singleton
elements), called multi-sets.

1. What are the issues of prime concern in this scenario?

2. Can the pattern also be defined on sets as characters? What should the
constraints be?

Hint: 1. If we convert these multiset input sequences to multiple sequences


as follows: 

 AGGTCG
AGGTTG

A|C G G T C|T G ⇔

 CGGTCG
CGGTTG

How many such multiple sequences result in general? Is there any escape from
this explosion?
2. It is better to allow for multiple sets in patterns, as long as each multi-set
S in the pattern satisfies
S ⊆ Si′ ,
where Si′ is the multi-set in the corresponding position of its ith occurrence
in the input.

Exercise 31 (Nontrivial vs flexibility) Consider the different classes of


patterns shown in Section 4.7. Notice that different definitions of occurrences
106 Pattern Discovery in Bioinformatics: Theory & Algorithms

on the same (or similar) pattern forms give rise to different pattern specifi-
cations. Usually the more flexible a pattern, the more likely it is to pick up
false-positives. However, biological reality may demand such a flexibility.
Discuss which of the patterns are more nontrivial than the others in each
of the groups below. Why?

1. string patterns (1a)-(1d),

2. permutation patterns (2a)-(2d),

3. partial order motifs (3a)-(3c),

4. bicluster motifs (5a)-(5b).

Exercise 32 (Recombination patterns or haplotype blocks) Patterns


sometimes may not fit into the ‘repeating phenomenon’ framework. Instead
they may be so defined so as to optimize a global function. One such exam-
ple is a haplotype block or sometimes also called the recombination pattern.
Consider a simplified definition of a haplotype block here.
The input, I, is an [n × m] array, where each column j, denotes a SNP
(Single Nucleotide Polymorphism) and each row i denotes a sample or an
individual. The order of the columns represents the order in which the SNPs
appear in a chromosome, although they may not be adjacent.
An interval, [j1 , j2 ], 1 ≤ j1 ≤ j2 ≤ m, is a block, b, represents the submatrix
of I consisting of rows from 1 to n and columns from j1 to j2 , and is written
as
Ib = I[j1 -j2 ] .
Let the length of the block be

l = j2 − j1 + 1.

The block pattern, bp , is defined to be a vector of length l where,

bp [i] = x, where the majority entries in column i of Ib is x,

for 1 ≤ i ≤ l. Given some constant,

0 < α ≤ 1,

a row i in block b is compatible if the Hamming distance (number of differ-


ences) between row i of block b and bp is no more than

α (j2 − j1 + 1).

Given I and some α, the task is to find a minimum number of block patterns
that partition I into blocks. In other words, every column, j, must belong to
exactly one block
What Are Patterns? 107

1. Does this problem always have a solution? If it does, is it unique?


2. Note that α is defined for a row i. Is it meaningful to define such a
constant for a column j?
3. Is there a necessity to define maximal blocks (patterns)?
4. Is the problem so defined ‘nontrivial’ ? Why?
Hint: 1. When α = 1, it is easy to construct an I that has no solution.
Possible block partitions for an input I and α = 0.5 are shown below.

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
1 0 1 0 1 1 0 0 1 1 0 1 0 1 1 0 0 1
1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0
1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 1 0
1 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 1
1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1 1 0
b1 b2 b3 b1 b2
1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0
pb1 pb2 pb3 pb1 pb2
(1) Block Partition 1. (2) Block Partition 2.

1 2 3 4 5 6 7 8 9 1 2 3 4
5 6 7 8 9
1 0 1 0 1 1 0 0 1 1 0 1 0
1 1 0 0 1
1 0 0 0 1 0 0 0 0 1 0 0 0
1 0 0 0 0
1 0 1 0 1 0 1 1 0 1 0 1 0
1 0 1 1 0
1 1 0 1 1 0 0 0 1 1 1 0 1
1 0 0 0 1
1 0 0 0 1 0 1 1 0 1 0 0 0
1 0 1 1 0
b1 b2 b1
1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0
pb1 pb2 pb1
(3) Block Partition 3. (4) Block Partition 4.

Exercise 33 (Haplogroup patterns) Here is another example where pat-


terns do not fit into the ‘repeating phenomenon’ framework. Consider the
following scenario.
The input, I, is an [n × m] array, where each column j, denotes a SNP
and each row i denotes a sample or an individual.
The task is to find patterns that divide the rows into at most K (possibly
overlapping) groups.
1. Discuss possible formalizations of this problem.
2. Is there a necessity to define maximal patterns?
108 Pattern Discovery in Bioinformatics: Theory & Algorithms

3. For a ‘nontrivial’ formalization, devise a method to solve the optimiza-


tion task.
4. How do the solutions with K ′ < K groups compare?
Hint: 1. See a possible scheme with the example below.
s1 =0 1 0 1 0 1 1 1 1 0
s2 =1 1 0 1 1 1 0 0 1 1
s3 =0 1 1 1 1 1 0 0 1 0
s4 =1 0 0 0 1 0 0 1 1 1
s5 =1 0 1 0 1 0 0 1 1 1

pattern 1 = * 1 * 1 * 1 * * * * for group {s1 , s2 , s3 }


pattern 2 = * * * * 1 * * * 1 * for group {s2 , s3 , s4 , s5 }
4. This is sometimes known as the issue of model selection.

Exercise 34 (Anti-patterns, forbidden words) The subject of this chap-


ter (and the book) is repeating phenomenon in data. However, a unique or
an absent phenomenon is also of interest sometimes and is called an anti-
pattern. An absent string pattern is also called a forbidden word [BMRS02]
in literature.
The following table shows some characteristic properties of patterns. Note
the use of the idea of ‘minimality’ (instead of maximality).
Discuss the essential difference between patterns and anti-patterns.

# of occurrences,
p length property
k
pattern
If p′ is a substring p,
pattern ≥2 maximal then p′ is a also a pattern

If p is a substring p′ ,
unique =1 minimal and p′ occurs in the input,
then p′ is also unique

anti-pattern
If p is a substring p′ ,
forbidden =0 minimal then p′ is also forbidden

Hint: What and why is the dramatic shift in paradigm as k changes from 0
to 1 to ≥ 2.
What Are Patterns? 109

Comments
While we get get down and dirty in the rest of the chapters, where the focus
is on a specific class of patterns, in this chapter we indulge in the poetry of
abstraction. One of the nontrivial ideas in the chapter, is that of redundancy
of patterns, which I introduced in [PRF+ 00], by taking a frequentist view
of patterns in strings, rather than the usual combinatorial view. Also, a
nonobvious consequence of this abstraction is the identification of maximality
in permutations as recursive substructures (PQ trees; see Chapter 10).

As an aside, out of curiosity, we take a step back and pick a few definitions of
the word ‘pattern’ from the dictionary:
1. natural or chance marking, configuration, or design: patterns of frost
markings.
2. a combination of qualities, acts, tendencies, etc., forming a consistent
or characteristic arrangement: the behavior patterns of ambitious scien-
tists.
3. anything fashioned or designed to serve as a model or guide for some-
thing to be made: a paper pattern for a dress.
In fact none of these, that I picked at random, actually suggest repetitiveness.
The etymology of the word is equally interesting, confirming a Latin origin,
and I quote from Douglas Harper’s online etymology dictionary:
1324, ‘the original proposed to imitation; the archetype; that which is
to be copied; an exemplar’ [Johnson], from O.Fr. patron, from M.L.
patronus (see patron). Extended sense of ‘decorative design’ first recorded
1582, from earlier sense of a ‘patron’ as a model to be imitated. The
difference in form and sense between patron and pattern wasn’t firm
till 1700s. Meaning ‘model or design in dressmaking’ (especially one of
paper) is first recorded 1792, in Jane Austen. Verb phrase pattern after
‘take as a model’ is from 1878.
Part II

Patterns on Linear Strings


Chapter 5
Modeling the Stream of Life

Hypotheses are what we lack the least.


- attributed to J. H. Poincare

5.1 Introduction
In 1950, when it appeared that the capabilities of computing machines were
boundless, Alan Turing proposed a litmus test of sorts to evaluate a machine’s
intelligence by measuring its capability to perform human-like conversation.
The test with a binary PASS/FAIL result, called the Turing Test, is described
as follows: A human judge engages in a conversation via a text-only channel,
with two parties one of which is a machine and the other a human. If the
judge is unable to distinguish the machine from the human, then the machine
passes the Turing Test.
In the same spirit, can we produce a stream or string of nucleotides ‘in-
distinguishable’ from a human DNA fragment (with the judge of this test
possibly being a bacteria or a virus)?
In this chapter we discuss the problem of modeling strings: DNA, RNA
or protein sequences. We use basic statistical ideas without worrying about
structures or functions that these biological sequences may imply. Such mod-
eling is not simply for amusing a bacteria or a virus but for producing in-silico
sample fragments for various studies.1

5.2 Modeling a Biopolymer


Biopolymer is a special class of polymers produced by living organisms.
Proteins, DNA and RNA are all examples of biopolymers. The monomer unit
in proteins is an amino acid and in DNA and RNA it is a nucleic acid.

1 A quick update on the capabilities of computing machines: More than half a century later,

it is quite a challenge to engage a machine in an ‘intelligent’ conversation.

113
114 Pattern Discovery in Bioinformatics: Theory & Algorithms

5.2.1 Repeats in DNA


This is just to remind the reader that DNA displays high repetitiveness.
Here we briefly discuss some classes of repeats seen in the human DNA.
Tandem repeats and variable number tandem repeats (VNTR) in DNA occur
when two or more nucleotides is repeated and the repetitions are directly
adjacent to each other. For example,

TTACGTTACGTTACGTTACG

is a tandem repeat where


TTACG

is repeated four times in the sequence. A VNTR is a short nucleotide sequence


ranging from 14-100 base pairs that is organized into clusters of tandem re-
peats, usually repeated in the range of between 4-40 times per occurrence.
Microsatellites consist of repeating units of 1-4 base pairs in length. They
are polymorphic, in the sense that they can be characterized by the number
of repeats. However, they are typically neutral, and can be used as molecular
markers in the field of genetics, including population studies. One common
example is the
(C A) n

repeat, where n varies across samples providing different alleles. This repeat
in fact is very frequent in human genome.
A short tandem repeat (STR) in DNA is yet another class of polymorphisms
repeating units of 2-10 base pairs in length and the repeated sequences are
directly adjacent to each other. For example

(C A T G) n

is an STR. These are usually seen in the noncoding intron region, hence ‘junk’
DNA. The A C repeat is seen on the Y chromosome.
Short interspersed nuclear elements (SINEs) and Long Interspersed Ele-
ments (LINEs) are present in great numbers in many eukaryote genomes.
They are repeated and dispersed throughout the genome sequences. They
constitute more than 20% of the genome of humans and other mammals. The
most famous SINE family are the Alu repeats typical of the human genome.
SINEs are very useful as markers for phylogenetic analysis whereas STRs and
VNTRs are useful in forensic analysis.
Moral of the story: Although this section discussed the human DNA, re-
peats are known to exist in nonhuman DNA including that of bacteria and
archaea. Thus DNA in a sense is highly nonuniform and it is perhaps best to
model segments of the DNA separately, than a ‘single-size-fits-all’ model.
Modeling the Stream of Life 115

5.2.2 Directionality of biopolymers


The DNA, RNA and even protein sequences, can be viewed as linear strings.
Nature also endows an orientation to these polymers i.e., either a left-to-right
or a right-to-left order.
The backbone of the DNA strand is made from alternating phosphate and
sugar residues. The sugars are joined together by phosphate groups that form
phosphodiester bonds between the third and fifth carbon atoms of adjacent
sugar rings. It is these asymmetric bonds that give a strand of DNA its
direction. The asymmetric ends of a strand of DNA are referred to as the 5’
end and the 3’ end.
The protein polymer is built from twenty different amino acids. Due to
the chemical structure of the individual amino acids, the protein chain has
directionality. The end of the protein with a free carboxyl group is known as
the C-terminus (carboxy terminus), and the end with a free amino group is
known as the N-terminus (amino terminus).
Moral of the story: The directionality in the biopolymers, in a sense, makes
the modeling a simpler task, i.e., they can be viewed as left-to-right strings.
Contrast this with the scenario of aligning multiple double stranded DNA
segments, marked only by the restriction enzyme cut sites, in the absence of
the direction information [AMS97].2

Back to modeling. Having noted the nature of some of the biopolymers, we


get down to the task of modeling a generic biopolymer under most simplistic
assumptions.
Consider the task of producing a n-nucleotide DNA fragment. Here we
make our first assumption about the underlying model that produces this
DNA fragment.
Independent & Identical: One of the simplest models is the i.i.d model: a
sequence is independent and identically distributed (i.i.d.)3 if each element
of the sequence has the same probability distribution as the other and all are
mutually independent. Let

prA , prC , prG , prT

be the probabilities of occurrence of the nucleotide base A, C, G and T re-


spectively. Since these are the only bases possible in the DNA fragment,

prA + prC + prG + prT = 1.

2 Optical mapping is a single molecule method and the reader is directed to [SCH+ 97, Par98,

CJI+ 98, Par99, PG99] for an introduction to the methods and the associated computational
problems.
3 A statistician’s love for alliteration is borne out by the use of terms such as iid and MCMC

(Markov Chain Monte Carlo).


116 Pattern Discovery in Bioinformatics: Theory & Algorithms

5.2.2.1 Random number generator


A random number generator is designed to generate a sequence of num-
bers that lack any pattern. In other words the sequence is ‘random’. Most
computer programming languages include functions or library routines that
perform the task of a random number generator. These routines usually pro-
vide an integer or a floating point (real) number uniformly distributed over
an interval.
For ease of exposition, in the rest of the chapter, assume the availability of a
function RANDOM(·) that returns a random integer r, uniformly distributed
over the interval
[0 . . . 1000).
In other words,
0 ≤ r = RANDOM(·) < 1000,
and each integer
0 ≤ r < 1000,
has the same probability of being picked by the function RANDOM(·).

Back to the model. Under a further simplifying assumption, that the


chance of occurrence of each nucleotide is the same, we have,
1
prA = prC = prG = prT = .
4
Then each s[i], is generated as follows:


 A if 0 ≤ r < 250,
C if 250 ≤ r < 500,

s[i] =

 G if 500 ≤ r < 750,
T otherwise.

Nonequiprobable case. It is easy to modify this to a more general sce-


nario where the probability of occurrence of each character (nucleotide) is not
necessarily the same but
prA + prC + prG + prT = 1.
Let
rA = 1000 prA ,
rC = rA + 1000 prC ,
rG = rC + 1000 prG .
Then 

 A if 0 ≤ r < rA ,
C if rA ≤ r < rC ,

s[i] =

 G if rC ≤ r < rG ,
T otherwise.

Modeling the Stream of Life 117

A general scenario. This can be further generalized to an alphabet of size


N where each character
aj , 1 ≤ j ≤ N,
occurs with probability praj and

N
X
praj = 1.
j=1

Assuming (see Section 5.2.2.1)

0 ≤ r = RANDOM(·) < 1000,

let

r0 = 0,
rj = rj−1 + 1000 praj , for each 0 < j ≤ N .

Then, s[i] is constructed as follows:

s[i] = aj , if rj−1 ≤ r < rj , (5.1)

where r is the random integer picked by function RANDOM(·). It can be


verified that for
0 ≤ r < 1000,
there always exists a j such that the following holds:

rj−1 ≤ r < rj .

We leave the proof of this statement as an exercise for the reader (Exercise 35).

5.2.3 Modeling a random permutation


A permutation is an arrangement of the members (elements) of a finite
alphabet Σ where each element appears exactly once. The alphabet Σ could
be a cluster of genes that is common to two different organisms (see Chapter 11
for details).
Given a finite alphabet Σ, what is meant by a random permutation of size
|Σ|?
We give a precise meaning to this term in two different ways. The fact that
the two models are equivalent is left as an exercise for the reader (Exercise 38).

Model 1 (Sample-space model). Recall the definitions of the terms used


below from Section 3.2.1.1.
118 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. The sample space, Ω, is the collection all possible permutations of Σ. In


other words,
Ω = {s | s is a permutation of the elements of Σ}.

2. An event is the outcome of a random trial, which in this case is the


process of drawing exactly one permutation from the sample space.
3. It is easy to see that
|Ω| = |Σ|!.
Each permutation s ∈ Ω has an equal chance of being drawn. Thus the
probability mass function, MP (s), is defined as follows. For each s ∈ Ω,
1
MP (s) = .
|Σ|!
Notice that X
MP (s) = 1.
s∈Ω

Having set the stage, a random permutation is the outcome of a random trial
of the probability space defined above.

Model 2 (Constructive model). Yet another way of viewing a random


permutation is as follows. A permutation s is produced under the following
scheme. Assume an urn with Σ number of balls, each labeled by a distinct
element of
σi ∈ Σ.
Each ball has an equal chance of being drawn from the urn. At each iteration,
a ball is drawn from the urn and its label is noted. Note that this ball is not
replaced in the urn and the size of the urn reduces by 1 at each iteration.
How do we generate a random permutation using the RANDOM(·) (see
Section 5.2.2.1) function? Recall
0 ≤ r = RANDOM(·) < 1000.
We construct the permutation s in |Σ| iterations as described below. At each
iteration k, 1 ≤ k ≤ |Σ|, s[k] is constructed as described below.
1. Initialize σ ′ = φ, the empty symbol.
2. At iteration k, 1 ≤ k ≤ |Σ|,
(a) Let
Σk = Σ \ {σ ′ }
= {σ1 , σ2 , . . . , σj , . . . , σNk },
where
Nk = |Σk |.
Modeling the Stream of Life 119

(b) For 1 ≤ j ≤ Nk , define the following:

r0 = 0,
1
pr(σj ∈ Σk ) = ,
|Σk |
rj = rj−1 + 1000 prσ .

(c) Then, s[k] is constructed as follows:


i. Let r be the integer returned by RANDOM(·).
ii. If rj−1 ≤ r < rj , then s[k] is set to σj .
(d) Then set σ ′ to s[k] and go to Step 2 for the next iteration.

Wrapping up. The second model provides a way of constructing (or sim-
ulating) a random permutation, which has the precise property of the first
model.

5.2.4 Modeling a random string


Given
1. a finite alphabet Σ and
2. an integer n,
what is meant by a random string of size n?
This is an overloaded term and we explore its meaning here. As before, we
give a precise meaning to this term in two different ways. The fact that the
two models are equivalent is left as an exercise for the reader (Exercise 39).

Model 1 (Sample-space model). Recall the definitions of the terms used


below from Section 3.2.1.1.
1. The sample space, Ω, is the collection all possible strings on Σ of length
n each. In other words,

Ω = {s | s is a string on Σ of length n}.

2. An event is the outcome of a random trial, which in this case is the


process of drawing exactly one string from the sample space.
3. It is easy to see that
|Ω| = |Σ|n .
Each string s ∈ Ω has an equal chance of being drawn. Thus the proba-
bility mass function, MP (s), is defined as follows. For each s ∈ Ω,
1
MP (s) = .
|Σ|n
120 Pattern Discovery in Bioinformatics: Theory & Algorithms

Notice that X
MP (s) = 1.
s∈Ω

Having set the stage, a random string is the outcome of a random trial of the
probability space defined above.

Model 2 (Constructive model). Yet another way of viewing a random


string is as follows. A string s produced under the following scheme,

1. i.i.d. model and

2. equi-probable alphabet, i.e.,

prai = praj , for all i and j,

and X
prai = 1.
i

is a random string. This construction has already been discussed in the earlier
part of this section.

Wrapping up. The second model provides a way of constructing (or simu-
lating) a random string, which has the precise property of the first model.
In a sense the random string is the simplest model and other more sophis-
ticated schemes, that are possibly better models of biological sequences, are
discussed in the following sections.

5.3 Bernoulli Scheme


A stochastic process is simply a random function. Usually, the domain over
which the function is defined is a time interval (referred to as a time series
in applications). In our case the domain is a line, each integral point being
referred to as a position. A stationary process is a stochastic process whose
probability distribution at a fixed time or position is the same for all times or
positions.
The Bernoulli scheme is a sequence of independent random variables

X1 , X2 , X3 , . . .

where each random variable is a generalized Bernoulli trial (see Section 3.2.4),
i.e., it takes one possible value from a finite set of states S, with outcome
Modeling the Stream of Life 121

xi ∈ S occurring with probability pri such that


X
pri = 1.
xi ∈S

In fact the model discussed in Section 5.2 is a Bernoulli scheme with n


random variables where n is the length of the fragment under construction.

5.4 Markov Chain


A Markov chain is a sequence of random variables

X1 , X2 , X3 , . . .

that satisfy the Markov property:

P (Xt+1 =x|X0 =x0 , X1 =x1 , . . . , Xt =xt ) = P (Xt+1 =x|Xt =xt ) (5.2)

The possible values of Xi form a countable set S is called the state space
of the chain. We will concern ourselves only with finite S (often written as
|S| < ∞). Also by Bayes’ rule (Equation (3.4)), we get:

P (Xt+1 =x|X0 =x0 , X1 =x1 , . . . , Xt =xt )

= P (X0 =x0 ) P (X1 =x1 |X0 =x0 ) . . . P (Xt+1 =x|Xt =xt )


In other words, the Markov property states that:
In a Markov chain, given the present state, the future as well as the past
states are independent.
Hence, Markov chains can be described by a directed graph with |S| vertices
or nodes where
(1) each vertex is labeled by a state described by a one-to-one mapping
M : S → V , and
(2) the edges are labeled by the probabilities of transitions from one state
to the other states, called the transition probabilities.
(3) Further, since the weights on the edge labels are interpreted as proba-
bilities, the sum of the weights of the outgoing edges, for each vertex,
must add up to 1.
Such a graph is also referred to as a finite state machine (FSM). See Fig-
ure 5.1(a) for a simple example.
Yet another way of stating the Markov property is as follows:
122 Pattern Discovery in Bioinformatics: Theory & Algorithms

Markov process (S, P) with S = {C,G}

0.8 0.2 P
0.2 0.8 0.2 C
C G 0.8 0.2 G
0.8 C G
(a) Finite state machine (b) The transition probability matrix

(c) An observable from this process: CGGGCC GGGC.

FIGURE 5.1: (a) Specification of a Markov process: A finite state ma-


chine (or a directed graph) with two states given as S = {C, G}. Each
transition (edge) is labeled with the probability of going from one state to
another. (b) The corresponding transition probability matrix P.

If the FSM is in state x at time t, then the probability that it moves


to state y at time t + 1 depends only on the current state and does not
depend on the time t.
The FSM can also be represented by a matrix (see Figure 5.1(b)), called the
transition (probability) matrix usually denoted as P. We follow the convention
that
P[i, j]
represents the probability of going from state xi to state xj , also written as:
P[i, j] = P (Xt+1 =xj |Xt =xi ).
Thus matrix P has the following properties:
1. P[i, j] ≥ 0 for each i and j since it represents a probability. In other
words P is a nonnegative matrix.
2. The entries in each row must add up to 1. In other words, the row vector
is a probability vector. We call such a matrix a stochastic matrix.
A vector with nonnegative values whose entries add up to 1 is called a
stochastic vector or simply stochastic.
Stochastic matrix P is a transition matrix for a single step and the k-step
transition probability can be computed as the kth power of the transition
matrix, Pk .
Any stochastic matrix can be considered as the transition matrix of some
finite Markov chain. When all rows as well as all columns are probability
vectors, we call such a matrix doubly stochastic. To summarize, we specify a
Markov process as the tuple
(S, P), (5.3)
Modeling the Stream of Life 123

where

1. S is a finite set of states and

2. P is a stochastic matrix.

This is also conveniently represented by the FSM (graph) G.

5.4.1 Stationary distribution


We next discuss an important concept associated with Markov chains, which
is the idea of the probability vector of the state space.
For example consider the Markov process shown in Figure 5.1. A possible
string of length 10 generated by this process is

C G G G C C G G G C.

In fact the process can generate all possible strings of C and G. Then, how
is this different from another Markov process with the same state space S,
capable of generating all possible strings of C and G, but with a different
transition matrix P?
So a natural question arises from this:

Given the specification of a Markov process (say as an FSM), what can


be said about the probability of the occurrences of each state (C, G in
our example) in sufficiently long strings generated by this process?

Mathematically speaking, we seek a probability row vector π that satisfies


the following:
π = πP. (5.4)

π is called the stationary distribution.4 The two following questions are of


paramount interest.

1. (Question 1): Does π always exist?

2. (Question 2): If it does, is it unique?

We address the two questions in the discussion below.

Question 1 (Existence). Consider a state space

S = {A,T,C,G},

4 Note that π is a left eigenvector of P associated with the eigenvalue 1.


124 Pattern Discovery in Bioinformatics: Theory & Algorithms

and the following stochastic matrix, which is the transition probability matrix
of a Markov process with
 
0.3 0.3 0.4 0 A
 0 0.4 0 0.6  T
P= 
 0.25 0.25 0.25 0.25  C (5.5)
0 0.5 0 0.5 G
A T C G
It turns out that there is no real π (i.e., all entries of the vector are real)
satisfying
πP = π
for this P. In matrix theory, we say that P is reducible if by a permutation of
rows (and corresponding columns) it can be reduced to the form
 
′ P11 P12
P = ,
0 P22
where P11 and P22 are square matrices. If this reduction is not possible, then
P is called irreducible. The following theorem gives the condition(s) under
which a matrix has a real solution.

THEOREM 5.1
(Generalized Perron-Frobenius theorem) A stochastic irreducible matrix
P has a real solution (i.e., values of π are real) to Equation (5.4).

The proof of this theorem is outside the scope of this book and we omit it.
It turns out that P of Equation (5.5) is not irreducible, since the following
holds:  
  0.3 0.4 0.3 0 A
′ P11 P12  0.25 0.25 0.25 0.25  C
P = = 
0 P22  0 0 0.5 0.5  G
0 0 0.4 0.6 T
A C G T
Given a matrix, is there a simple algorithm to check if a matrix is irreducible?
We resort to algebraic graph theory for a simple algorithm to this problem.
We have already seen that P is the edge weight matrix of a directed graph
with S as the vertex set (also called the FSM earlier in the discussion).
A directed graph is strongly connected if there a directed path from xi to
xj for each xi and xj in S. See Figure 5.2 for an example. The following
theorem gives the condition under which a matrix is irreducible.

THEOREM 5.2
(Irreducible matrix theorem) The adjacency matrix of a strongly con-
nected graph is irreducible.
Modeling the Stream of Life 125

Thus to check if a directed graph is strongly connected, all we need to see


is that every vertex is reachable by every other.
A depth first search (DFS) is a linear time,

O(|V | + |E|),

recursive algorithm to traverse a graph. It turns out that we can check if the
directed graph is strongly connected by doing the following.

1. Fix a vertex v ′ ∈ V .

2. (a) Carry out a DFS in G from v ′ and


(b) check that all vertices were traversed.
If there exists a vertex that cannot be traveresed then the graph is not
strongly connected.

3. Transpose the graph, i.e., reverse the direction of every edge in the graph
to obtain GT .

(a) Carry out a DFS in GT from v ′ and


(b) check that all vertices were traversed.

If not, then the graph is not strongly connected.

Thus we can check for strongly connectedness of the graph in just two DFS
traversals i.e., in
O(|V | + |E|)
time. Why is the algorithm correct?

• The first DFS traversal on the graph G initiated at v ensures that every
other vertex is reachable from v.

• The second DFS traversal on the transpose of the graph GT initiated at


v ensures that v is reachable from every other vertex.

• Hence for any two vertices u, w there is at least a path from u to v to


w and vice-versa.

This algorithm is a special case of the Kosaraju Algorithm.

Algorithm 4 DFS Traversal Algorithm


DFS(v)
Mark v
For each unmarked neighbor u of v
DFS(u)
126 Pattern Discovery in Bioinformatics: Theory & Algorithms

C G C G

A U A U

T T

(a) Undirected, connected. (b) Directed, strongly connected.

FIGURE 5.2: Examples of connected and strongly connected graphs.

Algorithm 5 Strongly Connected Check


DFS(G, v)
IF all vertices traversed
THEN DFS(GT , v)
IF all vertices traversed
THEN StronglyConnected
ELSE notStronglyConnected
ELSE notStronglyConnected

Question 2 (Uniqueness). Next, it is equally important to check if the


solution to Equation (5.4) is unique.

THEOREM 5.3
(Irreducible matrix theorem) A stochastic, irreducible P has a unique
solution to Equation (5.4).

The proof of this theorem is outside the scope of this book and we omit it.
Before concluding the discussion, we ask another question: Does it matter
what state we start the Markov process with? Consider the following transition
matrix:  
01
P=
10
Note that P is stochastic and irreducible, hence the solution π is unique and
is given below:
 
01
[1/2 1/2] = [1/2 1/2]
10
However,
 2  
01 10
P2 = = = Id
10 01
Modeling the Stream of Life 127

Thus 
Id for k even,
Pk =
P for k odd.
Since
[1/2 1/2]
is the unique nonnegative left eignenvector, it is clear that

lim v · Pk
k→∞

does not exist unless


v = [1/2 1/2].
Hence we need to have another constraint on P so that

lim v · Pk = π
k→∞

for every stochastic v.


A sufficient condition for this to happen is that P is primitive, i.e., Pk
is positive for some k. Under these conditions, the Perron-Frobenius theo-
rem states that Pk converges to a rank-one matrix in which each row is the
stationary distribution π, that is,

lim Pk = 1π
k→∞

where 1 is the column vector with all entries equal to 1. In other words,
With large t, the Markov chain forgets its initial distribution and con-
verges to its stationary distribution.

5.4.2 Computing probabilities


Consider the following stochastic matrix:
 
1/4 1/4 1/4 1/4 A
 1/6 1/6 1/3 1/3  C
P= 
 1/4 1/4 1/4 1/4  G
1/6 1/6 1/3 1/3 T
A C G T

We can show that the FSM graph is strongly connected, hence the matrix is
irreducible.
Another easy check is that since the matrix has no zero entry, the Markov
process is trivially irreducible and has a unique solution to Equation (5.4),
given as:
π = [5/24 5/24 7/24 7/24]
.
A C G T
128 Pattern Discovery in Bioinformatics: Theory & Algorithms

Now, we are ready to compute the probability, P (x), associated with string
x, where
x = x0 x1 x2 . . . xn ,
for a Markov process specified by P. This is given as follows:
n
Y
P (x) = π(x0 ) P (xi−1 |xi )
i=1
n
Y
= P[i − 1, i].
i=1

As a concrete example, let


x = CCGA.
Then,
P (x) = π(C) P (C|C) P (C|G) P (G|A)
= π(C) P[2, 2] P[2, 3] P[3, 1]
       
5 1 1 1
=
24 6 3 4
 
5
= .
1728

5.5 Hidden Markov Model (HMM)


A natural next step in modeling a biological sequence is a Hidden Markov
Model (or HMM) which has an underlying hidden (unobserved) state process
that follows a Markov chain.
For example, it has been found that there is perhaps some correlation be-
tween bending propensity or bendability of the DNA segment with promoter
regions of eukaryotic genes. Similarly, CG-rich or CG-poor is a compositional
property associated with similar DNA functions (see Exercise 44). Observa-
tions such as these could influence the process of generating a DNA sequence
and they can be captured using ‘hidden states’.
A Hidden Markov Model (Σ, S, P, E) is defined as follows:
1. A finite alphabet Σ,
2. a Markov process (S, P) (see Eqn (5.3)) where S is called the set of
hidden states, and
3. an emission matrix E. Each state in S is associated with a probability
vector, called the emission vector, on Σ. E is a
|S| × |Σ|
Modeling the Stream of Life 129

Hidden Markov Model (Σ, S, P, E)


with Σ ={A, T}, S = {C,G}

A 0.8 0.2 A P E
0.6 0.2 0.5 0.8 0.2 C 0.6 0.4 C
C G 0.8 0.2 G 0.5 0.5 G
T 0.4 0.8
0.5 T
C G A T
(a) (b) (c)

(d) A path in the HMM: C G C C G G G G C G C.


(e) An observable from this process: A T T A T T T T A A A.

FIGURE 5.3: (a) Specification of a Hidden Markov process with a finite


state machine (or a directed graph) with two states given as S as the states.
The dashed edges represent the emission vectors for each state. (b) The
transition probability matrix and (c) The emission matrix.

stochastic matrix where each row is an emission vector.


A path in HMM,
z = z0 z1 z2 . . . zn ,
is a sequence of states in S and a string,
x = x0 x1 x2 . . . xn ,
is a sequence of characters in alphabet Σ. Recall that
(S, P)
is a Markov process and for an irreducible P there exists a unique stationary
distribution π. Then the probability of x given path z is given as follows:
n
Y
P (x|z) = π(z0 ) P (x0 |z0 ) P (zi |zi−1 )P (xi |zi ).
i=1

By our convention,
P (xi |zi ) = E[zi , zi ]
and
P (zi+1 |zi ) = P[zi , zi+1 ].
Then we get the following:
n
Y
P (x|z) = π(z0 ) E[z0 , x0 ] P[zi , zi−1 ]E[zi , xi ].
i=1

See Figure 5.3 for an example.


130 Pattern Discovery in Bioinformatics: Theory & Algorithms

Why use HMMs? The primary reason for using a Hidden Markov Model
is to model an underlying ‘hidden’ basis for seeing a particular sequence of
observation x. This basis, under the HMM model, is the path z.

5.5.1 The decoding problem (Viterbi algorithm)


The output or observation of an HMM is the string x (the sequence of
emitted symbols from Σ). Then the possible questions of interest are:
1. What is the most likely explanation of x?
The explanation under this model is a plausible HMM path z. So the
question is rephrased as finding an optimal path z ∗ for a given x such
that
P (x|z)
is maximized, or,  
z ∗ = arg max P (x|z)
z

2. What is
P (x|z ∗ ),
the probability of observing x, given z ∗ ?
3. What is
P (x),
the (overall) probability of observing x?
The first and the second question are obviously related and are answered
simultaneously by using the Viterbi Algorithm. The third question is left as
an exercise for the reader (see Exercise 41).
We address the first question here. The algorithm for this problem was
originally proposed by Andrew Viterbi as an error-correction scheme for noisy
digital communication links [Vit67], hence it is also sometimes called a decoder.

LEMMA 5.1
(The decoder optimal subproblem lemma) Given an HMM,

(Σ, S, P, E),

and a string,
x = x1 x2 . . . xn ,
on Σ, consider
x1 x2 . . . xi ,
the i-length prefix of x. For a state

zj ∈ S
Modeling the Stream of Life 131

let the path ending at state zj given as

zj1 zj2 . . . zji

be optimal, i.e.,
 
zj1 zj2 . . . zji = arg max P (x1 x2 . . . xi |z1 z2 . . . zi )
z1 z2 ...zi

Let
fij
be the probability associated with string

x1 x2 . . . xi ,

given the optimal path ending at zj written as:

fij = P (x1 x2 . . . xi |zj1 zj2 . . . zji ).

Recall
zj = zji .
Then the following two statements hold:

fij = max f(i−1)k P (zj |zk ) P (xi |zj ) (5.6)
k

and
f1j = π(zj ) P (x1 |zj ). (5.7)
Also, let the optimal path ending at zj be written as

Path(zj ) = zj1 zj2 . . . zji .

Further, the i-length prefix of the optimal path is obtained as

zj Path(zk′ ),

where  


k = arg max f(i−1)k P (zj |zk ) P (xi |zj ) .
k

In spite of its intimidating looks, the lemma is very simple and we leave
the proof as an exercise for the reader (Exercise 43). In a nutshell, the lemma
states simply that the optimal solution,

x1 x2 . . . xn ,
132 Pattern Discovery in Bioinformatics: Theory & Algorithms

to the problem can be built from the optimal solutions to its subproblems
x1 x2 . . . xi .
We now describe the algorithm below based on this observation. Let x of
length n is the input string. Let Fi be a
1 × |S|
matrix where each
Fi [j]
stores fij of Lemma (5.1). Similarly,
P thi [j]
stores the corresponding path.
1. In the first phase, array F1 and P th1 are initialized.
2. In the second phase, where the algorithm loops, array Fi and P thi are
updated based on the observation in the lemma.
3. In the final phase, the algorithm goes over Fn [k] for each k and picks
the maximum value.
The algorithm takes
O(|S| |Σ|2 )
time. However, due to the underlying Markov process, Fi+1 depends only on
Fi and not on
Fi′ , i′ < i.
Thus only two arrays, say F0 and F1 , are adequate and the arrays can be
re-used in the algorithm. Thus the algorithm requires only
O(|S|)
extra space to run.

Algorithm 6 Viterbi Algorithm


Input: P, E, π, x; Output: opt, opt-path
// Initialize
For each j
F1 [j] ← π(j)E[j, x1 ]
P th1 [j] ← zj
// Main Loop
For i = 2, 3, . . . , n
Fi [j] ← maxk (Fi−1 [k] P[k, j] E[xi , j])
k′ ← arg (maxk (Fi−1 [k] P[k, j] E[xi , j]))
P thi [j] ← P thi [k′ ]zj
// Finale
opt ← maxk Fn [k]; k′ ← arg (maxk Fn [k])
opt-path ← P thn [k′ ]
Modeling the Stream of Life 133

5.6 Comparison of the Schemes


The Bernoulli scheme is a special case of a Markov chain where the transi-
tion probability matrix has identical rows. In other words, state t + 1 is even
independent of state t, in addition to being independent of all the previous
states.
Under a Bernoulli scheme, all strings s that have ij number of states for

xi , i = 1, 2, . . . N,

have the same probability given as


Y
P (s) = (pri )ij .
i

For example,

P ( A A T C C G) = P ( T C A A C G)
= P ( T G A A C C)
= P ( A C T A C G)
= P ( A C C A T G)
..
.
= (prA )2 prT (prC )2 prG .

However, a Markov chain model induces a probability distribution on the


strings that can possibly distinguish each of the cases above. A Hidden Markov
Model can even offer a possible explanation for a particular string.

5.7 Conclusion
What scheme best approximates a DNA fragment? What scheme best
approximates a protein sequence? These are difficult questions to answer
satisfactorily. It is best to understand the problem at hand and use a scheme
that is simple enough to be tractable and complex enough to be realistic.
Of course, it is unclear whether a more complex scheme is indeed a closer
approximation to reality. The design of an appropriate model for a biological
sequence continues to be a hot area of research.
134 Pattern Discovery in Bioinformatics: Theory & Algorithms

5.8 Exercises
Exercise 35 (Constructive model) Let

pra1 , pra2 , . . . , praN

be real values such that


N
X
praj = 1,
j=1

and let

r0 = 0,
rj = rj−1 + 1000 praj , for each 0 < j ≤ N .

Then show that if


0 ≤ r < 1000,
then there always exists a unique 1 ≤ j ≤ N such that the following holds:

rj−1 ≤ r < rj .

Exercise 36 (Random number generator) Let a random number gen-


erator, RAND(), return a floating point (real) number uniformly distributed
between 0 and 1. Modify Equation (5.1) to construct the sequence s.

Exercise 37 ∗∗ (Pseudorandom generator) There are mainly two methods


to generate random numbers.
The first measures some random physical phenomenon (with some corrections
to neutralize measurement biases).
The second uses a computational algorithm that produces long sequences of
(apparently) random numbers.The second method is called a pseudo-random
number generators and uses an initial seed.
1. Design a pseudo-random number generator.
2. Design a pseudo-random permutation generator.
Hint: Investigate statistical randomness and algorithmic randomness to use
as test for the results. Note that the two problems are topics of current active
research.

Exercise 38 (Random permutation)


1. Argue that in the second (constructive) model of Section 5.2.4,
Modeling the Stream of Life 135

(a) for a fixed position k the probability of s[k] taking the value σ is

1
,
|Σ|

for each σ ∈ Σ.
(b) for a fixed σ ∈ Σ, the probability of position k taking the value σ is

1
,
|Σ|

for each k.

2. Argue that the two models of a random permutation presented in Sec-


tion 5.2.3 have identical interpretations.

Hint: Note that the sum of two random numbers is random and the product
of two random numbers is random. Show that a random string/permutaion
produced by the second model satisfies the properties of the first model.

Exercise 39 (Random string) Argue that the two models of a random


string presented in Section 5.2.4 have identical interpretations.

Hint: Show that a random string/permutaion produced by the second model


satisfies the properties of the first model.

Exercise 40 Given an HMM

(Σ, S, P, E),

show that X
P (xi ) = 1,
i

where xi is any observed string of length of fixed length say k and P (xi ) is the
marginal probability of xi i.e.,
X
P (xi ) = P (xi |zj ),
j

where zj is an HMM path of length k.

Hint: Assume, for simplicity, |S| = 4, |Σ| = 2 and k = 4.


136 Pattern Discovery in Bioinformatics: Theory & Algorithms

Exercise 41 (Probability space & HMM) Let a probability space be de-


fined as
(Ω, 2Ω , MP )
where the probability distribution MP is defined by an HMM

(Σ, S, P, E).

Let Ω be the space of all strings on Σ of length k. Show that the Kolomogorov’s
conditions hold for this probability distribution.

Hint: Show that

1. P (ω ∈ Ω) ≥ 0 and

2. P (Ω) = 1.

Exercise 42 Let the output of an HMM be the string

x = x1 x2 x3 . . . xn

(i.e. a sequence of emitted symbols from Σ). What is the (overall) probability
of observing x under this model?

Hint: Note that there are many HMM paths of length n that emit x. How
many such paths exist?

Exercise 43 Prove Eqns (5.6) and (5.7) in The Decoder Optimal Subproblem
Lemma (5.1).

Hint: Use proof by contradiction.

Exercise 44 Design a bi-variate Hidden Markov Model where

1. one variable models the CG content (composition, as rich or poor) and

2. the other models the bendability of DNA (structure).

What is the probability space


(Ω, F, P )?
What are the issues with this architecture?
Modeling the Stream of Life 137

Hint: Note that in practice, bendability is usually a real-valued (continuous)


measure.

Exercise 45 Let D be large collection of long sequences of amino acids. The


task is to approximate a Markov Chain process based on D. The twenty states
of the Markov process correspond to the twenty amino acids in D,

ai , 1 ≤ i ≤ 20.

Let
#(ai aj )
denote the number of times the substring ai aj is observed in D.
For each i, the following is computed:
X
ni = #(ai aj ),
j
X
n= ni .
i

1. The transition probability matrix P is computed as follows:

#(ai aj )
P[i, j] = .
ni

2. The stationary probability distribution is computed as follows:


ni
π(ai ) = .
n

By this construction, does


πP = π
hold? Why?
Chapter 6
String Pattern Specifications

Keep it simple, but not simpler.


- anonymous

6.1 Introduction
One dimensional data is about the simplest organization of information.
It is amazing that deoxyribonucleic acid (DNA) that contains the genetic in-
structions for the entire development and functioning of living organisms, even
as complex as humans, is only linear. The genome, also called the blueprint of
the organism, encodes the instructions to build other components of the cell,
such as proteins and RNA molecules that eventually make up ‘life’.
As we have seen in Chapter 5 the biopolymers can be modeled as strings
(with a left-to-right direction). In this chapter we explore the problem of spec-
ifying string patterns. We begin with a few examples of patterns in biological
sequences.

Solid patterns. The ROX1 gene encodes a repressor of the hypoxic func-
tions of the yeast Saccharomyces cerevisiae [BLZ93]. The DNA binding motif
is recognized as follows:
CCATTGTTCTC

This pattern can also be extracted from DNA sequences of the transcriptional
factor ROX1.

Rigid patterns (with wild cards). One of the yeast genes required for
growth on galactose is called GAL4. Each of the UAS genes contains one or
more copies of a related 17-bp sequence called UASGAL,

G.CAAAA.CCGC.GGCGG.A.T

Gal4 protein binds to UASGAL sequences and activates transcription from a


nearby promoter. Note the presence of wild cards in the pattern.

139
140 Pattern Discovery in Bioinformatics: Theory & Algorithms

Extensible patterns (with homologous sets). Fibronectin is a plasma


protein that binds cell surfaces and various compounds including collagen, fi-
brin, heparin, DNA, and actin. The major part of the sequence of fibronectin
consists of the repetition of three types of domains, which are called type I,
II, and III. Type II domain is approximately forty residues long, contains four
conserved cysteines involved in disulfide bonds and is part of the collagen-
binding region of fibronectin. In fibronectin the type II domain is duplicated.
Type II domains have also been found in various other proteins. The fi-
bronectin type II domain pattern has the form shown below:

C..PF.[FYWI].......C-(8,10)WC....[DNSR][FYW]-(3,5)[FYW].[FYWI]C

Note the presence of homologous characters such as [FYWI] in the motif.


The wild cards are represented as ‘.’ characters and the extensible part of the
pattern is shown as integer intervals: (8,10) is to be interpreted as ‘gap’ of
length 8 to 10 and the interval (3,5) is similarly interpreted.

Back to introduction. As the pattern specification become less stringent,


such as allowing for wild cards, or permitting flexibility in terms of its length,
what happens to the size of the output? We explore this in details in this
chapter and in the next chapter present a generic discovery algorithm for
different classes of patterns.

6.2 Notation
Let the input s be a string of length n be over a finite discrete alphabet

Σ = {σ1 , σ2 , . . . , σL },

where |Σ| = L. A substring of s, written as

s[i..j], 1 ≤ i ≤ j ≤ n,

is the string obtained by yanking out the segment from index i to j from s.
A character from the alphabet,

σ ∈ Σ,

is called a solid character. A wild card or a ‘dont care’ is denoted as the ‘.’
character. This is to be interpreted as any character from the alphabet at
that position in the pattern. The size of the pattern is given as

|p |,
String Pattern Specifications 141

which is simply the number of elements in p. However, sometimes the size


may be defined by the area (size) of the segment in the input I that an an
occurrence of p covers.
A nontrivial pattern p has the following characteristics.
1. The size of p is larger than 1, i.e., |p |≥ 2.
An element of the alphabet,
σ ∈ Σ,
is a trivial (singleton) pattern.
2. The first and last element of p is solid, i.e.,
p[1], p[ |p |] ∈ Σ.
For example, if
Σ = {A, C, G, T},
then all the following patterns are trivial.
p1 =CTG..
p2 =. C.G.
p3 =. . . . .
p4 =. . CTG

Let P denote the set of all nontrivial patterns in s.


The location list, Lp , of a pattern p is the list of positions in s where p
occurs. If
|Lp | = k,
we also say that the pattern p has a support of k in s. For any pattern p,
|Lp |≥ 2.
Occurrence & operators ⊗ (meet), . Let q1 and q2 be defined on alphabet
Σ = {C, G, T }.
Consider the three following ‘alignments’ of q1 with q2 . Considering the ele-
ments along each column of this alignment, a disagreement is denoted by a
dont care ‘.’ character, in the last row below. The alignment is denoted by
(j1 , j2 ), which is interpreted as position j1 of q1 is aligned withe position j2
of q2 .
Alignment I: (1,1) Alignment II: (2,1) Alignment III: (3,1)
q1 GGGTGGGC GGGTGGGC GGGTGGGC
q2 GTTTGC GTTTGC GTTTGC
p G . . TGC . . . G . T . G . . . . GT . . GC
142 Pattern Discovery in Bioinformatics: Theory & Algorithms

This is a brief introduction and the details are to be presented for each class
of patterns in their respective sections. For a fixed alignment (j1 , j2 ), we use
the following terminology.
1. p is usually defined by removing the leading and trailing dont care el-
ements, so that its leftmost and rightmost element is a solid character.
Also, let |p |> 1. In the rest, we assume p is nontrivial.
2. p is called the meet of q1 and q2 with alignment (j1 , j2 ). Also,

p = q1 ⊗ q2 , for alignment (j1 , j2 )

3. p occurs at the location(s) dictated by the alignment in q1 and q2 . The


convention followed depends on the context, but could be taken as the
leftmost positions in q1 and q2 of the alignment. Also, we say

p  q1 and p  q2 .

Location list Lp . The location list, Lp , of a pattern p ∈ P is the list of


positions in the input s where p occurs. If

|Lp | = k,

we also say that the pattern p has a support of k in s. For any pattern p,

|Lp |≥ 2.

6.3 Solid Patterns


A solid pattern, p, as the name suggests, has only solid characters. Its
occurrence is simply defined as the appearance of p in the the input as a
substring. In other words, there exist some

1 ≤ j ≤ k ≤ n − |p |,

such that
p = s[j..k].
The solid pattern is also known as an l-mer in literature, where |p |= l.
Consider the example shown in Figure 6.1 which gives P and the location
list Lp , for each p ∈ P . It is not difficult to see that the number of nontrivial
solid patterns is
O(n2 ).
We next define maximal patterns that helps remove some repetitive patterns
in P .
String Pattern Specifications 143

s1 = a b c d a b c d a b c a b.

 

 a b c d a b c, b c d a b c, c d a b c, d a b c, 

a b c d a b, b c d a b, c d a b, d a b, 

 

 

a b c d a, b c d a, c d a, d a,
 
P = .

 a b c d, b c d, c d, 

a b c, b c,

 


 

a b.
 

La b c d a b c = La b c d a b
= La b c d a Lb c d a b c = Lb c d a b
= La b c d = Lb c d a
= {1, 5}. = Lb c d
= {2, 6}.
La b c = {1, 5, 9}.
Lb c = {2, 6, 10}.
La b = {1, 5, 9, 12}.

Lc d a b c = Lc d a b
Ld a b c = Ld a b
= Lc d a
= Ld a
= Lc d
= {4, 8}.
= {3, 7}.

FIGURE 6.1: P is the set of all nontrivial solid patterns on s1 . Lp is


the location list of p. We follow the convention that, i ∈ Lp , is the leftmost
position (index) in s1 of an occurrence of p.
144 Pattern Discovery in Bioinformatics: Theory & Algorithms

6.3.1 Maximality
A maximal solid pattern p is maximal in length, if p such cannot be extended
to the left or to the right, without decreasing its support, to get a nontrivial

p′ 6= p.

This is indeed the intuitive definition of maximality. It can be shown that


this definition is equivalent to the one in Chapter 4 in terms of its location
list. We leave this as an exercise for the reader (Exercise 46).
Thus pattern
a b c d a b c,
is maximal in s1 but patterns

a b c d a b, a b c d a, a b c d,

are not maximal in s1 since each can be extended to the right with the same
support (i.e., k = 2). By this definition, what are the maximal patterns? Let

Pmaximal (s1 ) ⊆ P (s1 ),

be the set of all maximal patterns. Then

Pmaximal (s1 ) = {a b c d a b c, a b c, a b}.

How small can |Pmaximal (s)| be for a given s?

6.3.2 Counting the maximal patterns


We show that the number of maximal solid patterns in a string of size n
can be no more than n. This is believed to be a ‘folklore’ theorem.1 This
section substantiates the folklore using suffix trees. Given s, a suffix tree is
the compact trie of all suffixes of s. This is explained through a concrete
example below.
Given a string s of size n, let

$ 6∈ Σ.

We terminate s with $ as
s$,

1 An alternative proof is from [BBE+ 85]: It descends from the observation that for any two

substrings p1 and p2 of s, if
Lp1 ∩ Lp2 6= ∅,
then p1 is a prefix of p2 or vice versa.
String Pattern Specifications 145

ab dabc
b c

$ dabc ab$ dabcab$


c c
$ ab$

ab$ 12 13
dabcab$
11 8 4
dabc dabc ab$
ab$

dabcab$ 9 10 3 7
dabcab$
ab$ ab$

1 5 2 6

FIGURE 6.2: Suffix tree T (s1 ) where s1 = a b c d a b c d a b c a b $.


All the suffixes sufi , 1 ≤ i ≤ 13 can be read off the labels of the unique path
from the root node to the leaf node marked by the integer i.
146 Pattern Discovery in Bioinformatics: Theory & Algorithms

for reasons that will soon become obvious. This is best explained with a
concrete example. A suffix of s, sufi (1 ≤ i ≤ n), is defined as

sufi = s[i..n].

Continuing the working example, s1 is written as

s1 = a b c d a b c d a b c a b $.

Then the suffixes of s1 are as follows.

suf1 =abcd abcdabcab$


suf2 =bcda bcdabcab$
suf3 =cdab cdabcab$
suf4 =dabc dabcab$
suf5 =abcd abcab$
suf6 =bcda bcab$
suf7 =cdabcab$
suf8 =dabcab$
suf9 =abcab$
suf10 =bcab$
suf11 =cab$
suf12 =ab$
suf13 =b$

The suffix tree of s1 , written as

T (s1 ),

is a compact trie of all the 13 suffixes and is shown in Figure 6.2. The direction
of each edge is downwards. It has three kinds of nodes.
1. The square node is the root node. It has no incoming edges.
2. The internal nodes are shown as solid circles. Each has a unique incom-
ing edge and multiple outgoing edges.
3. The leaf nodes are shown as hollow circles. A leaf node has no outgoing
edges.
The edges of T (s) are labeled by nonempty substrings of s and the tree satisfies
the following properties.
1. The tree has exactly n leaf nodes are labeled bijectively by the integers

1, 2, . . . , n.
String Pattern Specifications 147

2. Each internal node has at least two outgoing edges.


3. No two outgoing edges of a node can be labeled with strings that start
with the same character.
4. Suffix sufi is obtained by traversing the unique path from the root node
to the leaf node labeled with i and concatenating the edge labels of this
path.
T (s) is called the suffix tree of s.
A suffix tree has various elegant properties, but we focus only on those that
help prove the folklore.

1. (Property 1): Let r be a label on an incoming edge on note v. Then


r appears exactly k times in s where k is the number of leaf nodes
reachable from node v.
2. (Property 2): The number of internal nodes, Is , in T (s) satisfies the
following:
Is ≤ n.

The first property follows from the definition of the suffix tree and we have
seen the second property in Chapter 2. We now make the crucial observation
about the suffix tree.

THEOREM 6.1
(Suffix tree theorem) Given s,

p ∈ Pmaximal (s)

can be obtained by reading off the labels of a path from the root node to an
internal node in T (s).

PROOF Given a node v, let

pth(v)

denote the string which is the label on the path from the root node to v.
Each maximal pattern is be a prefix of some suffix,

sufi , 1 ≤ i < n,

hence can be read off from the (prefix) of the label of a path from the root
node.
Let p be a maximal motif. If

p = pth(v),
148 Pattern Discovery in Bioinformatics: Theory & Algorithms

then v cannot be a leaf node since then the motif must end in the terminating
symbol $.
Assume the contrary, that there does not exist any internal node v such
that
p = pth(v).
Hence p must end in the middle of an edge label, say r where the edge is
incident on some internal node v ′ . Let

r = r1 r2 ,

where r1 is a suffix of the maximal pattern p. Let k be the number of leaf


nodes reachable from v ′ . Then by Property 1 of suffix trees, p appears k times
in s. Then the concatenated string

p r2

must also appear k times in s, contradicting the assumption that p is maximal.


Hence for some internal node v,

p = pth(v).

This concludes the proof.

In the example in Figure 6.2, note each

p ∈ Pmaximal (s1 )

is such that there exist node v in T (s1 )


However, the converse of this theorem is not true. Let

p1 = b c,
p2 = b c d a b c,
p3 = c d a b c,
p4 = d a b c,

and
p1 , p2 , p3 , p4 6∈ Pmaximal (s1 ).
Notice in Figure 6.2 that for each pi , ≤ i ≤ 4,, there exists some node vi in
T (s1 ) such that
pth(vi ) = pi .
Thus the pattern corresponding to

p = pth(v),

for any arbitrary internal node v is not maximal. Since we only need an upper
bound on |Pmaximal (s)|, this is not a concern.
String Pattern Specifications 149

THEOREM 6.2
(Maximal solid pattern theorem) Given a string s of size n, the number
of maximal solid patterns in s is no more than n, i.e.,

|Pmaximal (s)| < n.

This directly follows from Property (2) of suffix trees and Theorem (6.1).

6.4 Rigid Patterns


Unlike the solid pattern, a rigid pattern, p, has also ‘dont care’ characters,
written as ‘.’. This is also called a dot character or a wild card. Thus p is
defined on
Σ ∪ {‘.’},
also written for convenience as:

Σ + ‘.’

The pattern is called rigid since the length it covers in all its occurrences is the
same (in the next section we study patterns that occupy different lengths).
Next, we define the occurrence of a rigid motif. We first define the following.
Let
σ1 , σ2 ∈ Σ.
Then we define a partial order relation () as follows.

If (σ1 = ‘.′ ) ⇔ σ1 ≺ σ2 .
If (σ1 ≺ σ2 ) OR (σ1 = σ2 ) ⇔ σ1  σ2 .

A rigid pattern p occurs at position j in s if for 1 ≤ l ≤ |p |,

p[l]  s[j + l − 1]

p is said to cover the interval

[j, j + |p | −1]

on s. Also for two strings with

|p |= |q|,

we say
p  q,
150 Pattern Discovery in Bioinformatics: Theory & Algorithms

if and only if the following holds for 1 ≤ l ≤ |p |,

p[l]  q[l].

Thus the occurrence of a rigid motif p in s is simply defined as follows:


There exist some
1 ≤ j ≤ k ≤ n − |p |,
such that
p  s[j..k].
If p and q are of different lengths, then the smaller string is padded with ‘.’
characters to the right, so the padded string is of the same length as the other,
for the comparison. See Exercise 59 for the connection between relation 
between patterns and their location lists.
Note that a nontrivial rigid pattern cannot start or end with a ‘.’ character.
Consider the example of the last section:

s1 = a b c d a b c d a b c a b.

Let P (s1 ) be the set of all nontrivial rigid patterns and we consider a subset,
P ′ (s1 ), as shown below.
 

 c d a b c, c d a b, c d a, c d 

c . a b c, c . a b, c . a,

 


 

 c d . b c, c d . b,





 
c d a . c, c . . b,
 

P (s1 ) = ⊂ P (s1 ).

 c . . b c, 

c d . . c,

 


 

c . a . c,

 


 

c . . . c,
 

Following the convention that, i ∈ Lp , is the leftmost position (index) in s1


of an occurrence of p, for each p ∈ P ′ (s1 ),

Lp = {3, 7}.

In fact, if q is a nontrivial pattern in s1 with

Lq = {3, 7},

then
q ∈ P ′ (s1 ).

6.4.1 Maximal rigid patterns


Since the rigid patterns can have the ‘.’ character in then, maximality for
rigid pattern is a little more complicated than that of solid patterns. If the
String Pattern Specifications 151

‘.’ character in a p can be replaced by a solid character, it is called saturating


the pattern p. For example
p=a. c. e
can be saturated to p1 or p2 or p3 where
p1 = a b c . e,
p2 = a . c d e,
p3 = a b c d e.
A maximal rigid pattern p must satisfy both the conditions below.
1. (maximal in length) p is such that it cannot be extended to the left or
to the right, without decreasing its support, to get a nontrivial
p′ 6= p.

2. (maximal in composition) p is such that there exists no other p′ with


p  p′ and Lp = Lp′ .
In other words, a maximal p cannot be saturated without decreasing its
support.
This is indeed the intuitive definition of maximality for rigid motifs. It can
be shown that this definition is equivalent to the one in Chapter 4 in terms
of its location list.
Consider P ′ (s1 ) of the running example. Here each pattern can be saturated
and the saturated motif can be extended to give
cdabc
as the maximal rigid pattern. Thus it can be shown that the maximal rigid
patterns are:
Pmaximal (s1 ) = {a b c d a b c, a b c, a b}.
It turns out that for s1 the set of solid maximal patterns is the same as the
set of rigid maximal patterns. Consider the following:
s2 = a b d d a c c d a b c a b.
Then the set of maximal rigid patterns is given as
Pmaximal (s2 ) = {a . . d a . c, a b}.
Following the convention that, i ∈ Lp , is the leftmost position (index) in s2
of an occurrence of p,
La . . d a . c = {1, 5},
La b = {1, 9, 12}.
152 Pattern Discovery in Bioinformatics: Theory & Algorithms

p size Lp
|←− ℓ −→|
c c c c ··· c ccc ℓ {1, ℓ + 2}
c c c ··· c c c c ℓ−1 {1, 2, ℓ + 2, ℓ + 3}
c c ··· c c c c ℓ−2 {1, 2, 3, ℓ + 2, ℓ + 3, ℓ + 4}
c ··· c c c c ℓ−3 {1, 2, 3, 4, ℓ + 2, ℓ + 3, ℓ + 4, ℓ + 5}
.. .. ..
. . .
c ccc 4 {1, 2, . . . , ℓ − 3, ℓ + 2, ℓ + 3, . . . , 2ℓ − 3}
ccc 3 {1, 2, . . . , ℓ − 2, ℓ + 2, ℓ + 3, . . . , 2ℓ − 2}
cc 2 {1, 2, . . . , ℓ − 1, ℓ + 2, ℓ + 3, . . . , 2ℓ − 1}

FIGURE 6.3: All nontrivial maximal motifs in s3 with no ‘.’ elements.

6.4.2 Enumerating maximal rigid patterns


How small can |Pmaximal (s)| be for a given s?
We have already seen that the number of maximal solid patters is linear
in the size of the input. However, we demonstrate here that the number of
maximal rigid patterns is very large (possibly exponential in the size of the
input).
The discussion in this section is not to be construed as a method for enumer-
ation (a general discovery algorithm is presented in Section 7.2), but merely
a means to gain insight into the complexity of interplay of patterns when a
dont care character (the ‘.’ character) is permitted in the pattern.
Construct, s3 , an input string of length

n = 2ℓ + 1,

with 2ℓ with a stretch of ℓ c’s, followed by one a, followed by ℓ c’s as shown


below.
s3 = c c c · · · c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→|
In the discussion below, we follow the convention that, i ∈ Lp , is the left-
most position (index) in the input of an occurrence of p.
All nontrivial patterns with no ‘.’ character is shown in Figure 6.3.
Figure 6.4 shows the longest pattern with exactly one ‘.’ character. Each
pattern is of length (or size) ℓ and if p1i is a pattern with one ‘.’ character at
position i of the pattern, then

Lp1i = {1, ℓ − (i − 2)}.

The number of such patterns is

ℓ − 2.
String Pattern Specifications 153

p Lp i
|←− ℓ −→|
c . c c ··· c c c c {1, ℓ} 2
c c . c · · · c c c c {1, ℓ − 1} 3
c c c . · · · c c c c {1, ℓ − 2} 4
.. .. ..
. . .
c c c c ··· . c c c {1, 5} ℓ − 3
c c c c ··· c . c c {1, 4} ℓ − 2
c c c c ··· c c . c {1, 3} ℓ − 1

FIGURE 6.4: The longest patterns with exactly one ‘.’ character at posi-
tion i in the pattern in s3 . Each is of length ℓ and is maximal in composition
but not in length.

It is interesting to note that the longest pattern with exactly one ‘.’ character
is not maximal. Each pattern listed above is maximal in composition but not
maximal in length. In fact, there is no maximal pattern with exactly one ‘.’
character in s3 .
Figure 6.5 shows the maximal patterns with exactly two ‘.’ elements at
positions i and j of the pattern. The maximal pattern can be written as p2i ,
which has an ‘.’ character at position i and position ℓ + 1 of the pattern. p2i
is of length ℓ + i − 1, and
Lp2i = {1, ℓ − (i − 2)}.
The number of such maximal patterns is ℓ − 1.
Figure 6.7 shows a few examples of patterns with more than two ‘.’ elements
and each of these can be constructed from the patterns of Figure 6.5. It can
be verified that the number of such maximal patterns with j + 1 ‘.’ elements
is  
ℓ−1
.
j
Thus including the maximal patterns with no ‘.’ elements and exactly two ‘.’
elements, the total number of maximal patterns is give as:
ℓ−3  
X ℓ−1
2(ℓ − 1) + .
j=3
j

Note that
n
ℓ≈ ,
2
thus the number of maximal patterns is
O(2n ).
154 Pattern Discovery in Bioinformatics: Theory & Algorithms

p Lp i, j
|←− ℓ + 2 −→|
{1, ℓ} 2, ℓ + 1
c . c ··· c . c

|←− ℓ + 3 −→|
{1, ℓ − 1} 3, ℓ + 1
c c . c ··· c . c c

|←− ℓ+4 −→|


{1, ℓ − 2} 4, ℓ + 1
c c . c c ··· c c . c c
.. .. ..
. . .

|←− 2ℓ − 2 −→|
{1, 4} ℓ − 2, ℓ + 1
c c c ··· c . c c c . c ···c c c

|←− 2ℓ − 1 −→|
{1, 3} ℓ − 1, ℓ + 1
c c c c ··· c c . c . c c ···c c c c

|←− 2ℓ −→|
{1, 2} ℓ, ℓ + 1
c c c c ··· c c c . . c c c ···c c c c

FIGURE 6.5: All maximal patterns in s3 with two ‘.’ elements at positions
i and j of the pattern. Note that the length of each pattern is > ℓ + 1.

p p Lp
←− ℓ+2 −→ ←− ℓ+2 −→
c • ccc- - - c• c c • c c c - - - c • c 1, ℓ
c c •cc- - - c• c c c c • c c - - - c • c c 1, ℓ-1
c c c•c- - - c• c cc c c c • c - - - c • c c c 1, ℓ-2
.. ..
. .
c c c - -c• c c• c c ---c ccc- - c •cc • cc- - - c 1, 4
c c c - -cc • c• c c ---cc c cc- - c c •c• c c- - - c c 1, 3
c c c - -cc c •• c c ---cc c c c c- -c c c ••c c - - - c c c 1, 2
←− ℓ − 1 −→ 2 ←− ℓ − 1 −→ ←− ℓ−1 −→ 2 ←− ℓ−1 −→
←− 2ℓ −→ ←− 2ℓ −→

FIGURE 6.6: The ℓ+1 collection of maximal patterns with two ‘.’ elements
(shown as •) on s3 have been stacked in two different ways (left flushed and
right flushed) to reveal the ‘pattern’ of their arrangement within the maximal
patterns.
String Pattern Specifications 155

p Lp i, j, k
|←− ℓ + 2 −→|
{1, ℓ − 1, ℓ} 2, ℓ, ℓ + 1
c . c c ··· c . . c

|←− ℓ + 2 −→|
{1, 2, ℓ} 2, 3, ℓ + 1
c . . c ··· c c . c

|←− ℓ + 2 −→|
{1, ℓ − 2, ℓ} 2, ℓ − 1, ℓ + 1
c . c c ··· . c . c

|←− ℓ + 2 −→|
{1, 3, ℓ} 2, 4, ℓ + 1
c . c . ··· c c . c

|←− ℓ + 2 −→|
{1, 4, ℓ} 2, 5, ℓ + 1
c . c c . ··· c c . c

|←− ℓ+3 −→|


{1, ℓ − 2, ℓ − 1} 3, ℓ, ℓ + 1
c c . c c ··· c . . c c

|←− ℓ+3 −→|


{1, 3, ℓ − 1} 3, 5, ℓ + 1
c c . c . ··· c c . c c

|←− ℓ + 3 −→|
{1, 4, ℓ − 1} 3, 6, ℓ + 1
c c . c c . ··· c . c c

|←− ℓ + 4 −→|
{1, 5, ℓ − 2} 4, 8, ℓ + 1
c c c . c c c . ··· . c c c

|←− 2ℓ − 3 −→|
c c c c··· . c c . . ···c c c c {1, 4, 5} ℓ − 3, ℓ, ℓ + 1
|← ℓ−4 →| |← ℓ−4 →|

|←− 2ℓ − 3 −→|
c c c c··· . c . c . ···c c c c {1, 3, 5} ℓ − 3, ℓ − 1, ℓ + 1
|← ℓ−4 →| |← ℓ−4 →|

|←− 2ℓ − 2 −→|
c c c c c··· . c . . ···c c c c c {1, 3, 4} ℓ − 2, ℓ, ℓ + 1
|← ℓ−3 →| |← ℓ−3 →|

FIGURE 6.7: Some maximal patterns in s3 with three ‘.’ elements each
at positions i, j and k of the pattern.
156 Pattern Discovery in Bioinformatics: Theory & Algorithms

6.4.3 Density-constrained patterns


Is it possible that the number of maximal patterns is high, since the patterns
are allowed to have a large stretch of ‘.’ elements? See some of the maximal
pattern in s3 below.
c . . ... . . c
c c . ... . . c
c . . ... . c c
c c . ... . c c
|←− ℓ −→|
We define a version of rigid patterns with density constraints. This dictates
the ratio of the ‘.’ character to the solid characters in the pattern. One way of
specifying the density is to impose a bound, d, on the number of consecutive
‘.’ elements in a pattern. This is to be interpreted as follows: In a pattern,
no two solid characters can have more than d consecutive ‘.’ elements.
Does the density-constraint help in cutting down the number of maximal
(restricted) rigid patterns?
Let d = 1, i.e., two solid characters can have at most one ‘.’ character
between them. For √ example, the patterns that meet the density constraints
are marked with (the rest are marked with ×) below.

p1 =c . . ccc. c ×
p2 =c . c. c. . c ×
p3 =c . . . . . . c ×

p4 =c . c. c. cc

p5 =c ccccccc

However, we construct an example to show that even under this stringent


density requirement, where d takes the smallest possible value of 1, the number
of maximal density-constrained rigid patterns can be very large. Recall s3 of
length 2ℓ + 1, defined in the last section:

s3 = c c c · · · c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→|

Construct s4 of length n = 4ℓ + 1,, based on s3 , as follows:

s4 = c a c a c a · · · c a c a c a g c a c a c a · · · c a c a c a
|←− 2ℓ −→| |←− 2ℓ −→|

Note that each


c
of s3 is replaced with
ca
String Pattern Specifications 157

in s4 (and a is replaced with g). Let p be a pattern on s3 whose alphabet set


is:
Σs3 = {a, c}.
p′ is constructed from p by replacing each element

x ∈ Σ + ‘.′

with
x a.
For example, p1 and p2 are maximal motifs in s3 below. Notice that p′1 and
p′2 meet the density constraint (d = 1).

|←− ℓ+2 −→|


p1 =
c . . c c ··· c c c . c
|←− 2ℓ + 4 −→|
p′1 =
c a . a . a c a c a ··· c a c a c a . a c a
c c c c ··· c c c c . . . . . ... . . . . . c
p2 =
|←− ℓ − 1 −→| |←− ℓ −→| 1
c a c a c a ··· c a c a c a . a . a . a ··· . a . a . a c a
p′2 =
|←− 2ℓ − 2 −→| |←− 2ℓ −→| 2

Next, we make the following claims.

1. If p is pattern in s3 , then p′ is a pattern in s4 .

2. If p is maximal in s3 , then p′ is maximal in s4 .

The proof of the claim is straightforward and we leave that as an exercise for
the reader (Exercise 54). Thus the number of maximal density-constrained
(d = 1) rigid patterns in s4 is also

O(2n ).

6.4.4 Quorum-constrained patterns


For any pattern p,
|Lp | ≥ 2.
Here we define a version of rigid patterns with quorum constraint. This dic-
tates that for a specified quorum k(< n), the following must hold:

|Lp | ≥ k.

Although, this filters out patterns that occur less than k times, it is not
sufficient to bring the count of the patterns down. We next show that even
158 Pattern Discovery in Bioinformatics: Theory & Algorithms

for the quorum-constrained rigid patterns, the number of maximal motifs can
be very large.
Consider the input string s3 of Section 6.4.1. The number of quorum con-
strained maximal motifs can be verified to be at least:
ℓ−3  
X ℓ−1
(ℓ − k + 1) + .
j
j=k

The patterns can be constrained to meet both the density and quorum
requirements, yet the number of maximal patterns could be very large. This
is demonstrated by calculating the number patterns on s4 . We leave this as
an exercise for the reader.

6.4.5 Large-|Σ| input


Note that the s3 ’s alphabet set is

{c, a}

and s4 ’s alphabet is
{c, a, g}.
Both the sets are fairly small. Is it possible that if the alphabet is large, say,

Σ = O(n),

then the number of maximal patterns is limited? Let



ℓ = n,

and
Σ = {e, 0, 1, 2, . . . , ℓ−1, ℓ}.
For convenience, denote
ℓi = ℓ − i.
Next we construct s5 and for convenience display the string in ℓ rows as
follows:

s5 = 0 1 2···3 ℓ3 ℓ2 ℓ1 ℓ
0 e 2···3 ℓ3 ℓ2 ℓ1 ℓ
0 1 e···3 ℓ3 ℓ2 ℓ1 ℓ
0 1 2···e ℓ3 ℓ2 ℓ1 ℓ
..
.
0 1 2 3 ··· e ℓ2 ℓ1 ℓ
0 1 2 3 ··· ℓ3 e ℓ1 ℓ
0 1 2 3 ··· ℓ3 ℓ2 e ℓ
String Pattern Specifications 159

We make the following claim about s5 .


Let

D ⊆ {1, 2, . . . , ℓ1 } and
q = 0 1 2 3 · · · ℓ3 ℓ2 ℓ1 ℓ

For each j ∈ D, obtain q(D) of length ℓ + 1 as follows: replace the element j


in p′ with the dont care character ‘.’. For example,

D1 = {1, 2}, q(D1 ) = 0 . . 3 · · · ℓ3 ℓ2 ℓ1 ℓ


D2 = {2, ℓ2 , ℓ1 }, q(D1 ) = 0 1 . 3 · · · ℓ3 . . ℓ

Next, we then make the following claims:


1. For nonempty sets
D1 6= D2 ,
clearly the following holds:

q(D1 ) 6= q(D2 ).

2. For some nonempty D, there exists a unique maximal motif, say pD , in


s5 which has a prefix q(D). In other words,

pD [1 . . . 1 + ℓ] = q(D).

Following the convention that, i ∈ Lp , is the leftmost position (index)


in s5 of an occurrence of p, we have
(a)
1 ∈ LpD .
(b) For each j ∈ D, the following position (index) is in LpD :

(ℓ + 1)(j − 1) + (j + 1) ∈ LpD .

(c)
|LpD | = |D| + 1.
3. The number of distinct such D’s is

O 2ℓ−1 .


To understand these claims, consider the case where ℓ = 5. Then s′5 (from
s5 ) is constructed as below:

s′5 = 0 1 2 3 4 5 0 e 2 3 4 5 0 1 e 3 4 5 0 1 2 e 4 5 0 1 2 3 e 5
↑ ↑ ↑ ↑ ↑
i= 1 7 13 19 25
160 Pattern Discovery in Bioinformatics: Theory & Algorithms

Then all possible D sets and the corresponding q(D) are shown below:

|D| = 1 |D| = 2 |D| = 3 |D| = 4


D q(D)
D q(D) {1, 2} 0 . . 3 4 5 D q(D)
{1} 0 . 234 5 {1, 3} 0 . 2 . 4 5 {1, 2, 3} 0 . . . 4 5 D q(D)
{2} 0 1. 34 5 {1, 4} 0 . 2 3 . 5 {1, 2, 4} 0 . . 3. 5 {1, 2,
0. . . . 5
{3} 0 12. 4 5 {2, 3} 0 1 . . 4 5 {1, 3, 4} 0 . 2. . 5 3, 4}
{4} 0 123. 5 {2, 4} 0 1 . 3 . 5 {2, 3, 4} 0 1. . . 5
{3, 4} 0 1 2 . . 5

The maximal pattern pD and its location list LpD for four cases are shown
below.

q(D) pD LpD
012. 45 012. 4 50. 23. 5 {1, 19}
01. . 45 01. . 4 50. 2. . 5 {1, 13, 19}
0. . 3. 5 0. . 3. 5 {1, 7, 13, 25}
0. . . . 5 0. . . . 5 {1, 7, 13, 19, 25}

We leave the proof of the claims for a general ℓ as an exercise for the reader
(Exercise 55).

Recall that ℓ = n. Thus the number of maximal rigid patterns in s5 is
 1
O 2n 2 .

6.4.6 Irredundant patterns


Given an input s, let Pmaximal (s) be the set of all maximal patterns in s.
A motif p ∈ Pmaximal (s) is redundant if

p = p1 ⊗ p2 ⊗ . . . ⊗ pl , for some alignment (i1 , i2 , . . . , il )

and for
p 6= pi ∈ Pmaximal (s), i = 1, 2, . . . l,
and the support of p is obtained from the support of the p′i s, i.e.,

Lp = Lp1 ∪ Lp2 ∪ . . . ∪ Lpl .

In other words, if all the information about a maximal pattern p is contained


in some other l maximal patterns, then since p has nothing new to offer, it is
a redundant pattern. Also, if need be, p can be deduced (constructed) from
these l patterns. Hence the set of irredundant patterns is also called a basis.
Let Pbasis (s) be the set of all irredundant rigid patterns in s.
String Pattern Specifications 161

However, there are some details hidden in the notation. We bring these out
in the following concrete example. For some input s, let

p0 , p1 , p2 , p3 , p4 ∈ Pmaximal (s).

and the following holds:

p1 G T T . G A
p2 G G G T G G A C C C
p3 GT . GACC
p4 GTTGAC
p0 . . . T . G A . . .

Thus p0 is obtained by aligning the pi ’s as shown. Incorporating the alignment


information,
p0 = p1 ⊗ p2 ⊗ p3 ⊗ p4
is annotated as
p0 = p1 (3) ⊗ p2 (4) ⊗ p3 (2) ⊗ p4 (2).
Thus pi (j), is to be interpreted as the jth position of pi corresponding to the
leftmost or position 1 of p0 . Let

L + j = {i + j |∈ L}.

Then, following the convention that, i ∈ Lp , is the leftmost position (index)


in s of an occurrence of p, we have

Lp0 = Lp1 +3
S
Lp +4
S 2
Lp +2
S 3
Lp4 + 2.

Next, we state the central theorem of this section [AP04]. We begin with a
definition that the theorem uses. Given an input s of length n, for 1 < i < n,
let ari be defined as follows:2

ari = s ⊗ s for alignment (1, i).

THEOREM 6.3
(Basis theorem)[AP04] Given an input s of length n,

Pbasis (s) = {ari | 1 < i < n}.

Thus |Pbasis (s)| < n.

2 ar is also called the ith autocorrelation.


i
162 Pattern Discovery in Bioinformatics: Theory & Algorithms

The proof is straightforward using the vocabulary developed so far and we


leave that as an exercise for the reader: see Exercise 61 for the roadmap of
the proof. A stronger result (of which this theorem is a corollary) is discussed
in [AP04].
For a concrete example, see Exercise 62 for the basis of s3 of Section 6.4.1.
Note that the size of the basis is
2ℓ − 2,
where as the number of maximal patterns was shown to be
O(2n ).
Although the size of the basis is linear in the size of the input, it is important
to bear in mind that extracting all the maximal motifs may still involve more
work.

6.4.6.1 Density-constrained patterns


If the patterns are restricted to satisfy the density constraints, then what
can be said about the size of the basis for this restricted class of rigid patterns?
Let
p1  p2 .
Since p2 is more specific (has at least as many solid characters as p1 ), if p1
does not meet the density requirement, then p2 does not meet the density
requirements. Hence if p1 is redundant w.r.t. maximal motifs
p 1 , p 2 , . . . , pl ,
then it is not possible that p1 meets the density requirement while one or
more of pi , 1 ≤ i < l, does not.
Although, this restriction filters out patterns that do not meet the density
requirements, the basis for this restricted class actually gets larger.

THEOREM 6.4
(Density-constrained basis theorem) Given s of length n, let
(d)
Pbasis (s)
denote the basis for the collection of rigid patterns that satisfy some density
constraint d > 0. Then
(d)
|Pbasis (s)| < n2 .

An informal argument is as follows. Each ari , 1 < i < n, may get frag-
mented at regions where the density constraint is not met. For example, let
density constraint d = 2 and consider
ari = c a c . . g . . . c c . c . c a t . . . c t.
String Pattern Specifications 163

Then ari has three fragments as follows:

ari1 = c a c . . g
ari2 = c c . c . c a t
ari3 = c t

The number of such fragments for each ari is no more than n. Hence the
size of the basis is
O(n2 ).

6.4.6.2 Quorum-constrained patterns


If the patterns satisfy the quorum constraint k ≥ 2, then what can be said
about the size of the basis for this restricted class of rigid patterns?
Although, this restriction filters out patterns that occur less than k times,
the basis for this restricted class actually gets larger.

THEOREM 6.5
(Quorum-constrained basis theorem) [AP04, PCGS05] Given s, let

(k)
Pbasis (s)

denote the basis for the collection of rigid patterns that satisfy quorum k (> 1).
Then :
 
 s ⊗ s ⊗ . . . ⊗ s, 
(k)
Pbasis (s) = for alignment 1 < i2 < . . . < ik < n .
(1, i2 , i3 , . . . , ik )
 

Thus
(k)
|Pbasis (s)| < (n − 1)k−1 .

This follows from the proof of Theorem (6.3). However, to show that such a
bound is actually attained, consider the concrete example of s3 of Section 6.4.1
and quorum k = 3. See Figure 6.7 for some maximal patterns p with

|Lp | = 3.

It can be verified that the number of such maximal patterns, in s3 , is


 
ℓ−2
.
2
164 Pattern Discovery in Bioinformatics: Theory & Algorithms

6.5 Extensible Patterns


Consider two occurrences of a pattern in the following string.
s6 = C G G T C G T T C G C A T A G
Note that the length of the cover of the two occurrences differ, yet there is
some commonality (shown in bold) in the two that is captured as follows:
p=CG-T.G
where the dash symbol ‘-’ is used to denote the variable gap. Its two occur-
rences, with the variable gap replaced by fixed ‘,’ elements are
C G . T . G and C G . . T . G
These two rigid patterns are also called realizations. Allowing for spacers in
a pattern is what makes it extensible. The dash symbol ‘-’ represents the
extensible wild card and
‘-’ 6∈ Σ.
Thus an extensible pattern is defined on
Σ+‘.’+‘-’
The density constraint d denotes the maximum number of consecutive dots
allowed in a string or the maximum size of the gap. We use the extensible
wild card denoted by the dash symbol ‘-’.
Given an extensible pattern (string) p, a rigid pattern (string) p′ is a re-
alization of p if each extensible gap is replaced by the exact number of dots.
The collection of all such rigid realizations of p is denoted by
R(p).
Thus in this example,
R(C G - T . G) = {C G . T . G,
C G . . T . G}.
As discussed earlier, a rigid string p occurs at position i on s if
p[j]  s[i + j − 1], for 1 ≤ j ≤ |p |.
A extensible string p occurs at position l in s if there exists a realization p′ of
p that occurs at l.
Note than an extensible string p could possibly occur a multiple number of
times at a location on a sequence s. See Exercise 64 for an illustration. In the
rest of the discussion our interest is in only the very first occurrence from the
left at a location. This multiplicity of occurrence increases the complexity of
the algorithm over that of rigid motifs in the discovery process discussed in
Section 7.2.
String Pattern Specifications 165

6.5.1 Maximal extensible patterns


See Exercises 65 and 66 for examples of trivial extensible patterns in the
absence of any density requirement. Thus it is very essential for a meaningful
extensible pattern to have a density constraint, d. This ensures that the
variable gap is no larger than d positions or characters.
A maximal extensible motif p must satisfy all the three conditions below.
1. (maximal in length) p is such that it cannot be extended to the left or
to the right, without decreasing its support.
2. (maximal in composition) p is such that no ‘.’ character can be replaced
by a solid character, without decreasing its support.
3. (maximal in extension) p is such that no extensible gap character of p
can be replaced by a fixed length substring (without extensible gaps),
without decreasing its support. In other words, no extensible gap can
be replaced by a fixed length substring (including the ‘.’ character) that
appears in all the locations in Lp .
An extensible motif that is maximal in length, in composition and in extension
is maximal.

Irredundant patterns. Note that the notion of redundancy stems from


the co-occurrence of distinct rigid or solid patterns in the same locations (or
segments) in the input. Thus if distinct patterns p1 , p2 , . . . , pl co-occur, we
define the redundant motif
p = p1 ⊗ p2 ⊗ . . . ⊗ pl ,
for some alignment.
This is straightforward and we leave that as an exercise for the reader
(Exercise 67).

6.6 Generalizations
Here we discuss two simple and straightforward generalizations of string
patterns: one where an element of the input is replaced by a set of elements
(called homologous sets) and the second where the input is a sequence of real
values.

6.6.1 Homologous sets


Consider the following input
s6 = [G L T] A T L [G L] A T [A T] G.
166 Pattern Discovery in Bioinformatics: Theory & Algorithms

Note that
Σ = {A, L, G, T}.
The input is interpreted as follows.

1. The first position is a either G or L or T; the fifth position is either G


or L; similarly the eighth position is A or T.

2. The second position is A; the third position is T and so on.

Thus each element, s[j], is a set for 1 ≤ j ≤ n. Usually, only a certain subset of
elements can appear at a position and they are called homologous characters.
For example, some homologous (groups) amino acids are shown below:

[L I V M]
[L I]
[F Y W]
[A S G]
[A T D]

The partial order, , on sets is defined as follows. For sets x, y,

x  y ⇔ (x ⊆ y),

and
‘.′  x.
Following the convention that, i ∈ Lp , is the leftmost position (index) in s6
of an occurrence of solid motif

p1 = G A T, and Lp1 {1, 5}.

p2 = [G L] A T, and Lp2 {1, 5}.


Rigid motifs:
p3 = G A T . G, and Lp3 {1, 5}.

p4 = [G L] A T . G, and Lp4 {1, 5}.


Note that
p1 , p2 , p3  p4 ,
with
Lp1 = Lp2 = Lp3 = Lp4 .
Thus pi is nonmaximal w.r.t. p4 , for i = 1, 2, 3.
String Pattern Specifications 167

6.6.2 Sequence on reals


Consider a sequence of real numbers. Is it possible to rename each distinct
real number as an element of a discrete alphabet Σ?
Let sr be the sequence of reals, and let sd be the corresponding sequence
with each distinct real number renamed as shown below:
sr = 0.7 3.6 2.2 0.75 2.1 2.2 0.80 6.1 2.2
sd = a b c d e c f g c
It is possible that values 0.75 and 0.80 may be considered fairly close but they
have been assigned two different characters d and f . However, if we decide
to assign any two values that differ by 0.05 the same character then consider
the following scenario:
s′r = 0.75 0.80 0.85 0.90 0.95 1.00
s′d = a a a a a a
Clearly, 0.75 is not the same as 1.0 but both are assigned a by this scheme. We
discuss a systematic method below to convert a sequence of reals to a string
on a finite alphabet so that there is a one-to-one correspondence between the
patterns in one to the other.
Two real values may be deemed equal if they are sufficiently close to each
other. This is usually specified using a parameter δ ≥ 0. Two real numbers x
and y are equal (written as x =r y)
x =r y ⇔ |x − y| ≤ δ.
To avoid confusion with the decimal point, in this section we denote the dont
care character ‘.’ by •. The partial order relation () is defined as follows.
For reals x, y,
x = y ⇔ (x =r y),
and
•  x.
Thus the occurrence of a pattern on a real sequence is defined as in Sec-
tion 6.4.1. Let
δ = 0.12
Following the convention that, i ∈ Lp , is the leftmost position (index) in sr
of an occurrence of p, consider the patterns that have location list
{1, 4, 7}.
p1 = 0.71 • 2.2,
p2 = 0.72 • 2.2,
p3 = 0.73 • 2.2,
p4 = 0.77 • 2.2,
p5 = 0.78 • 2.2,
..
.
168 Pattern Discovery in Bioinformatics: Theory & Algorithms

In fact there are uncountably infinite patterns with this location list. To
circumvent this problem, we allow the patterns to draw their alphabets not
from real numbers but from closed real intervals. For example, in this case
the first character is replaced by the interval
(0.69, 0.81)
and the unique pattern corresponding to this location list is:
p = (0.69, 0.81) • 2.2, Lp = {1, 4, 7}.
The partial order on an interval, (x, y), is defined naturally as follows (x1 , x2 ,
y1 , y2 are reals),
((x1 , x2 )  (y1 , y2 )) ⇔ ((x =r y) for each x1 ≤ x ≤ x2 and y1 ≤ y ≤ y2 ) .
Thus for i = 1, . . . , 5,
pi  p and Lp = Lpi .
In other words each pi is redundant (or nonmaximal) w.r.t. p.

6.6.2.1 Mapping to a discrete instance


The problem of reporting uncountably infinite patterns on a sequence of
reals is solved by an irredundant (or maximal) pattern defined on intervals of
reals. In this section, we map this problem to a problem on discrete characters.
This is explained through a concrete example. Consider a sequence of six
real numbers
7 8 10 4 1 6.
They are sorted as
1 < 4 < 6 < 7 < 8 < 10 .
and their position on the real line is shown on the bottom horizontal axis of
Figure 6.8. Let
δ = 3,
and the size of δ is shown by the length of the (bold) horizontal bar shown
for each number. Each unique interval intersection is labeled as
a, b, c, d, e, f, g, h
as shown. Each real number x can now be written as the set of alphabets
that span  
δ δ
x− ,x + .
2 2
Thus the given problem has been reduced to an instance of a problem defined
on a sequence of multi-sets (homologous sets). The fact that a solution to
the former problem can be computed from a solution to the latter is straight-
forward to see and we leave that as an exercise for the reader (Exercise 57).
String Pattern Specifications 169

(1) s = 7 8 10 4 1 6
a a
b b
c c
d d
e e
f f
g g
h

1 4 6 7 8 10 1 4 6 7 8 10
(2a) δ = 3. (3a) δ = 2.

1 ↔ {a} 1 ↔ {a}
4 ↔ {b,c} 4 ↔ {b}
6 ↔ {c,d,e} 6 ↔ {c,d}
7 ↔ {d,e,f} 7 ↔ {d,e}
8 ↔ {e,f,g} 8 ↔ {e,f}
10 ↔ {g,h} 10 ↔ {g}
(2b) (3b)

s3 = [d e f] [e f g] [g h] [b c] a [c d e] s2 = [d e] [e f] g b a [c d]
(2c) (3c)

a ↔ (−0.5, 2.5)
a ↔ (0, 2)
b ↔ (2.5, 4.5)
b ↔ (3, 5)
c ↔ (4.5, 5.5)
c ↔ (5, 6)
d ↔ (5.5, 6.5)
d ↔ (6, 7)
e ↔ (6.5, 7.5)
e ↔ (7, 8)
f ↔ (7.5, 8.5)
f ↔ (8, 9)
g ↔ (8.5, 9.5)
g ↔ (9, 11)
h ↔ (9.5, 11.5)
(2d) (3d)

FIGURE 6.8: The input real sequence s is shown in (1). For δ = 3: (2a)-
(2b) show the mapping of the reals on the real line to the discrete alphabet.
(2c) shows the string sδ on the discrete alphabet. (2d) shows the mappings
of the discrete alphabet to real (closed) intervals. For δ = 2, the same steps
are shown in (3a)-(3d).
170 Pattern Discovery in Bioinformatics: Theory & Algorithms

6.7 Exercises
Exercise 46 Consider the solid pattern definition of Section 6.3.

1. Show that the maximality of a solid pattern as defined in this section is


equivalent to the definition of maximality of Chapter 4.

2. Show that for solid patterns so defined, for a given s,

Pmaximal (s) = Pbasis (s).

Hint: 1. Use proof by contradiction. 2. What is p = p1 ⊕ p2 for solid


patterns?

Exercise 47 Is it possible to have a string s, with

|s| > 1,

such that the root node in its suffix tree has exactly one child?

Hint:

$
a

$
a

$
a

$
a$

Exercise 48 1. To construct a suffix tree, s is terminated with a meta


symbol $. What purpose does this serve?

2. For each node v in the suffix tree T (s), characterize the set {pth(v)}.
Note that
Pmaximal (s) ⊂ {pth(v)}.
String Pattern Specifications 171

Hint: 1. What if, sufi = s[j..k] holds for some 1 ≤ j < k ≤ n? 2. Is it the
collection of all nonmaximal patterns?

Exercise 49 In Section 6.3, we saw that the number of maximal solid pat-
terns, Pmaximal (s), in s of length n satisfies

|Pmaximal (s)| < n.

Show that this bound is tight, i.e, construct an example s′ where

|Pmaximal (s′ )| = n − 1.

Exercise 50 Consider the rigid pattern definition of Section 6.4. Show that
the maximality of a rigid pattern as defined in this section is equivalent to the
definition of maximality of Chapter 4.

Hint: Use proof by contradiction.

Exercise 51 Let
s = a b c d a b c d a b c a b.
We follow the convention that, i ∈ Lp , is the leftmost position (index) in s1
of an occurrence of p. Enumerate all the rigid patterns p in s such that

1. Lp = {1, 5},

2. Lp = {1, 5, 9},

3. Lp = {1, 5, 9, 12},

4. Lp = {2, 6},

5. Lp = {2, 6, 10},

6. Lp = {3, 7},

7. Lp = {4, 8}.

Exercise 52 Consider the following input sequence of length 2ℓ + 1.

s3 = c c c · · · c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→|

1. Is it true that every nontrivial maximal pattern in s3 has an occurrence


that begins at location 1 in s3 ? Why?
172 Pattern Discovery in Bioinformatics: Theory & Algorithms

2. Is it true that every nontrivial maximal pattern in s3 has an occurrence


that ends at location 2ℓ + 1 in s3 ? Why?
3. Show that there is no nontrivial maximal pattern of size ℓ + 1.

4. Show that there is no nontrivial maximal pattern with only one ‘.’ char-
acter.

5. Show that every nontrivial maximal pattern with a nonzero number of


‘.’ character is of length ≥ ℓ + 1.

Exercise 53 Consider the input string s3 of Section 6.4.1. Let the patterns
satisfy quorum k. Show that the number of quorum constrained maximal mo-
tifs is at least:
ℓ−3  
X ℓ−1
(ℓ − k + 1) + .
j
j=k

Hint: The patterns without dont cares and meeting the quorum constraint
is ℓ − k + 1.

Exercise 54 Consider s3 of length 2ℓ + 1, and s4 of length 4ℓ + 1, as follows.

s3 = c c c · · · c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→|

s4 = c a c a c a · · · c a c a c a g c a c a c a · · · c a c a c a
|←− 2ℓ −→| |←− 2ℓ −→|
Further, given p, p′ is constructed from p by replacing each element

x ∈ Σ + ‘.′

of p with
x a.
Then show that the following statements hold.

1. If p is pattern in s3 , then p′ is a pattern in s4 . Is the converse true?


Why?
2. p1 is a maximal pattern in s3 and p′1 is a maximal pattern in s4 where

|←− ℓ+2 −→|


p1 =
c . . c c ··· c c c . c
|←− 2ℓ + 4 −→|
p′1 =
c a . a . a c a c a ··· c a c a c a . a c a
String Pattern Specifications 173

3. p2 is a maximal pattern in s3 and p′2 is a maximal pattern in s4 where

c c c c ··· c c c c . . . . . ... . . . . . c
p2 =
|←− ℓ − 1 −→| |←− ℓ −→| 1
c a c a c a ··· c a c a c a . a . a . a ··· . a . a . a c a
p′2 =
|←− 2ℓ − 2 −→| |←− 2ℓ −→| 2

4. If p is maximal in s3 , then p′ is maximal in s4 .

5. The number of maximal patterns in s4 is

O(2n ).

Also, is the converse of statement 1 true? Why?

Hint: 5. Use the fact that the number of maximal patterns in s3 is exponen-
tial.

Exercise 55 (Large-|Σ| example) See Section 6.4.5 for the construction


of input string s5 and definition of D, q(D) and pD . Assume in the following

D 6= ∅.

1. Show that the number of distinct D’s is

O 2ℓ−1 .


2. Show that if D1 6= D2 , then q(D1 ) 6= q(D2 ).

3. For each D, there exists a motif pD in s5 which has a substring q(D),


i.e. for some r,
pD [r . . . r + ℓ] = q(D).
Show that pD is maximal.
Further show that pD is unique, for each D.

4. Following the convention that, i ∈ Lp , is the leftmost position (index)


in s5 of an occurrence of p, prove each of the following statements.

(a)
1 ∈ LpD .

(b) For each j ∈ D,

(ℓ + 1)(j − 1) + (j + 1) ∈ LpD .
174 Pattern Discovery in Bioinformatics: Theory & Algorithms

(c)
|LpD | = |D| + 1.

Hint: What are the location lists for each D?

Exercise 56 1. Let s8 of length 2ℓ + 4 be defined as:

s8 = c c c · · · c c c a c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→|

Enumerate the maximal patterns in s8 .

2. Let s9 of length 3ℓ + 2 be defined as:

s9 = c c c · · · c c c a c c c · · · c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→| |←− ℓ −→|

Enumerate the maximal patterns in s9 .

Hint: What are patterns with 0,1,2, .... dont care elements? Since the
patterns are maximal, consider the autocorrelations of the input sequences.

Exercise 57 (Real sequences) Consider, s, a sequence on reals. Let

0 ≤ δ1 ≤ δ2 .

Let si be the sequence of multi-sets constructed as described in Section 6.6.2,


for δi , i = 1, 2.

1. Let si be defined on Σi , i = 1, 2. Compare the sizes of Σ1 and Σ2 .

2. Let ni be the number of positions in si that are singletons, for i = 1, 2.


How do n1 and n2 compare?

3. Show that if
[σ1 , σ2 , . . . , σm ]
is a homologous set resulting from the construction with the mappings:

σ1 ↔ (l1 , u1 )
σ2 ↔ (l2 , u2 )
..
.
σm ↔ (lm , um )
String Pattern Specifications 175

Then the following holds:

u1 = l 2 ,
u2 = l 3 ,
..
.
um−1 = lm .

In other words, the homologous set is mapped to the (closed) interval:

[σ1 , σ2 , . . . , σm ] ↔ (l1 , um ).

Hint: See Figure 6.8.

Exercise 58 (Real sequences) Let δ = 0.5 and

sr = 0.75 0.80 0.85 0.90 0.95 1.00

Construct sd on a finite alphabet set, based on sr , such that patterns on sr


correspond to patterns on sd .
Hint: Note that the successive intervals on the real intersect at a point.

[a b] [b c d] [d c e] [e f g] [g h i] [i j].

Exercise 59 (Relation  and location lists Lp ) Let pi , i = 0, 1, 2, . . . , l


be patterns on s.
1.
If p1 ֒→ p2 , then p2  p1 .

2.
If {p1 , p2 , . . . , pl } ֒→ p0 , then p0  pi , i = 1, 2, . . . , l.

Prove the following.


(a) Statement (1) for solid patterns.
(b) Statements (1) and (2) for rigid patterns.
(c) Statements (1) and (2) for extensible patterns.
Do the above statements hold for density-constrained and quorum-constrained
patterns? Why?
176 Pattern Discovery in Bioinformatics: Theory & Algorithms

Hint: How are the location lists related?

Exercise 60 (Redundant pattern construction) For some input s, let


p1 , p2 , p3 , p4 be maximal motifs in s.

p1 = GT TGGA
p2 = GG GTGGACCC
p3 = GT . GACC
p4 = GT TGAC

Four meet operations with the alignments are shown below (see Section 6.4.6):

q1 = p2 ⊗ p3 ⊗ p4
q2 = p2 ⊗ p3
p2 G G G T GGACCC
p2 G G G T G G A C C C
p3 GT . GACC
p3 GT . GACC
p4 GT TGAC
q2 . . G T . G A C C .
q1 . . G T . GAC . .

q3 = p1 ⊗ p2 ⊗ p3 ⊗ p4 q4 = p1 ⊗ p2 ⊗ p3 ⊗ p4

p1 G T T G G A p1 GTTGGA
p2 G G G T G G A C C C p2 G G G T G G A C C C
p3 GT . GACC p3 GT . GACC
p4 GTTGAC p4 GTTGAC
q3 . . . T . G A . . . q4 . . G T . G . . . .

1. Are q1 and q2 maximal in s? Why?

2. Are q3 and q4 maximal in s? Why?

3. Can Lqi , 1 ≤ i ≤ 4, be computed using Lpi , 1 ≤ i ≤ 4? Why?

Hint: 2. & 3. Use proof by contradiction. 4. For qi , what if there exists


some other maximal p5 such that qi  p5 ?

Exercise 61 (Proof of linearity of basis) Let s be of length n.


For 1 < i < n, let

ari = s ⊗ s, for alignment (1, i).

1. Show that ari is a maximal pattern in s, for each i.


String Pattern Specifications 177

2. Show that it is possible that

ari = arj , for i 6= j.

3. Let Lp represent the leftmost position of the occurrence of p in s.


(a) Let
Lp = {i1 < i2 }.
Then show the following.
i. p is a substring of ari2 −i1 +1 .
ii. If p is maximal then p = ari2 −i1 +1 .
(b) Let
Lp = {i1 < i2 < . . . < ik }.
If p is maximal then show that

p = ari2 ⊗ ari3 ⊗ . . . ⊗ arik , for alignment (1, 1, . . . , 1).

4. Show that
Pbasis = {ari | 1 < i < n}.

Hint: 1. Use proof by contradiction. 2. See Exercise 62. 3. Follows from the
definition of maximality and the meet operator ⊗. Show the following.

p = ari2 −i1 +1
= s ⊗ s, for alignment (1, i2 − i1 + 1).

4. Show the following:

p = s ⊗ s ⊗ . . . ⊗ s,
for alignment (1, i2 − i1 + 1, i3 − i1 + 1, . . . , ik − i1 + 1)
= ari2 ⊗ ari3 ⊗ . . . ⊗ arik ,
for alignment (1, 1, . . . , 1).

Exercise 62 Let s3 of length 2ℓ + 1 be defined as

s3 = c c c · · · c c c a c c c · · · c c c
|←− ℓ −→| |←− ℓ −→|

Then show the following:


1. arℓ+1 = arℓ+2 .
178 Pattern Discovery in Bioinformatics: Theory & Algorithms

2. |Pbasis | = 2ℓ − 2.
Hint: s3 ’s basis is:
c c
c c c
c c cc
..
.
c c c- - cc
c c c- - cc c
c c c- - cc c c
c . cc c - - - c.c
c c .c c - - - c.c c
c c c. c - - - c.c c c
..
.
c c c- -c. c c.c c ---c
c c c- -cc . c.c c ---cc
c c c- -cc c ..c c ---cc c
←− ℓ − 1 −→ 2 ←− ℓ − 1 −→

Exercise 63 (Basis construction algorithm)


1. Design an algorithm to compute the basis for a string, based on Theo-
rem (6.3).
2. What is the time complexity of the algorithm?
Hint: Each ari can be constructed in O(n) time, even under density con-
straint d. Thus the basis can be constructed in time

O(n2 )

See [AP04] for an efficient incremental algorithm. See also [PCGS05] for a
nice exposition.

Exercise 64 (Multiple occurrences at i) Let

s = A T C G A T A.

What are the occurrences, with the left most position i = 1 in s, of

p=A.C−A?
String Pattern Specifications 179

Hint: The two occurrences of the extensible pattern at position 1 in s are:

ATCGA TA
ATCGATA

Exercise 65 (Trivial extensible patterns) Given an input s, let p be a


rigid pattern with
|Lp | > 2.

1. Construct an extensible pattern p′ that must occur in s with

|Lp′ | ≥ 2.

2. How many extensible patterns can be constructed based on p?

Hint: Let
Lp = {i1 , i2 , . . . , ik−1 , ik }.
1. Let
p′ = p−p
Then
Lp′ = {i1 , i2 , . . . , ik−1 }.
2. Let
p′ = p−p−p
Then
Lp′ = {i1 , i2 , . . . , ik−2 }.
And, so on.

Exercise 66 (Nontrivial extensible patterns) Consider the following ex-


tensible pattern
p = p1 −p2 − . . . −pl
where each of pi , 1 ≤ i ≤ k is rigid. Under what conditions is p nontrivial,
i.e., it cannot be simply constructed from pi and Lpi , 1 ≤ i ≤ l ?

Hint: Let the quorum be k(≥ 2). If some pi is such that it occurs less than
k times and this pi is used in more than one occurrence of p.
180 Pattern Discovery in Bioinformatics: Theory & Algorithms

Exercise 67 Given two extensible patterns p1 and p2 , that co-occur at a po-


sition i in s, devise a method to compute

p = p1 ⊗ p2 , for alignment (i1 , i2 ).

Hint:
p1 = A G A − C T A A − A . G − A
p2 = G − − − T G A A A − − A A C . G
p= G−−−T . A−A−−−A

Exercise 68 (Invariance under reversal) The reverse of a string s is


denoted as s̄. For example, if

s = A C G G T T C,

then
s̄ = C T T G G C A.

1. If p occurs at j in s, let the center of occurrence, jc , be given as

|p | −1
jc = j + .
2
Note that jc may not always be an integer, but that does not matter. We
follow the convention that, i ∈ Lp , is the center of the occurrence of p
in s. If p is a pattern in s, then show the following:

(a) p̄ is a pattern in s̄, and


(b) Lp = Lp̄ .

2. If p is maximal (nonmaximal) in s, then show that p̄ is maximal (non-


maximal) in s̄.

3. If p is irredundant (redundant) in s, then show that p̄ is irredundant


(redundant) in s̄.

4. Show the following:

|P (s)| = |P (s̄)|,
|Pmaximal (s)| = |Pmaximal (s̄)|,
|Pbasis (s)| = |Pbasis (s̄)|.
String Pattern Specifications 181

Comments
String patterns is about the simplest idea, in terms of its definition, in
bioinformatics. It is humbling to realize how complicated the implications of
simplicity can be. This area has gained a lot from a vibrant field of research
in computer science, called stringology (not to be confused with string theory,
from high energy physics).
Chapter 7
Algorithms & Pattern Statistics

Intuition without science


is like tango without a partner.
- anonymous

7.1 Introduction
In the previous chapter, we described a whole array of possible character-
izations of patterns, starting from the simple l-mer (solid patterns) to rigid
patterns with dont care characters to extensible patterns with variable length
gaps. Further, the element of a pattern could be drawn from homologous sets
(multi-sets). In this chapter we take these intuitive definitions to fruition by
designing practical discovery algorithms and devising measures to evaluate
the significance of the results.

7.2 Discovery Algorithm


We describe an algorithm [ACP05] for a very general form of pattern:

extensible pattern on multi-sets (homologous sets)


with density constraint d and quorum constraint k.

Input: The input is a string s of size n and two positive integers, density
constraint d and quorum k > 1.
Output: The density (or extensibility) parameter d is interpreted as the
maximum size of the gap between two consecutive solid characters in a pat-
tern. The output is all maximal extensible patterns that occur at least k times
in s.
The algorithm can be adapted to extract rigid motifs as a special case.
For this, is suffices to interpret d as the maximum number of dot characters
between two consecutive solid characters.

183
184 Pattern Discovery in Bioinformatics: Theory & Algorithms

Algorithm: The algorithm works by converting the input into a sequence


of possibly overlapping cells. A maximal extensible pattern is a sequence of
(overlapping) cells. Given a pattern p defined on

Σ + ‘.’+‘-’

a substring p̂, on p is a cell, also denoted by a triplet,

hσ1 , σ2 , ℓi ,

is defined as follows:
1. p̂ begins in σ1 and ends in σ2 where

σ1 , σ2 ⊆ Σ.

Note that σi , i = 1, 2, is a set of solid characters (homologous set),


possibly singleton.
2. p̂ has only nonsolid intervening characters.
ℓ is the number of intervening dot characters ‘.’.
If the intervening character is the extensible character, ‘-’, then ℓ takes
a value of -1.
C(p) is the collection of all cells of p. For example, if

p = A G . . C − T . [G C]

then the cells of p, C(p) are:

p̂ hσ1 , σ2 , ℓi
AG hA, G, 0i
G.. C hG, C, 2i
C−T hC, T, −1i
T . [G C] hT, [G C], 1i

This will be also used later in the probability computations of the patterns.
Initialization Phase: The cell is the smallest extensible component of a
maximal pattern and the string can be viewed as a sequence of overlapping
cells. The initialization phase has the following steps.
Step 1: Construct patterns that have exactly two solid characters in them
and separated by no more than d spaces or ‘.’ characters. This is done by
scanning the string s from left to right. Further, for each location, the start
and end position of the cell are also stored.
Step 2: The extensible cells are constructed by combining all the cells with
at least one dot character and the same start and end solid characters. The
location list is updated to reflect the start and end position of each occurrence.
Algorithms & Pattern Statistics 185

If B is the collection of all such cells in an input s, then it can be verified


that
|B| ≤ (2 + d) |Σ|2 .
Define the following order relation, where σ ∈ Σ,
‘−’ ≺ ‘.’ ≺ σ
The cells in B are arranged in descending lexicographic order. For two solid
characters, we can arbitrarily pick any order without affecting the results.
See the concrete example below for an illustration. The idea is that cells are
processed in the order of ‘saturation’, which roughly means maximal in com-
position and extensibility (but not necessarily in length). Thus, for example
AC≻ A .T
≻ A .. G
≻ A .. . C
≻ A − C.
Similarly,
CA≻ A.A
≻ G.. A
≻ C.. . A
≻ C − A.
Iteration Phase: The algorithm works by starting with a pattern in B,
and iteratively using compatible cells in B to generate an extended pattern.
The process is repeated until this pattern can no longer be extended. This
‘maximally’ extended pattern is emitted and then we move on to the next cell
in B.
However, once a pattern p is emitted, it cannot be withdrawn. Hence it is
required to make sure that p is maximal and there can be no other p′ that can
be generated later to make p nonmaximal w.r.t. p′ . How can this condition
be ensured?
This is done by processing the cells in a decreasing order of saturation. For
p1 , p2 ∈ B,
p2 can possibly be added to the right of p1 if the last character of p1 is the same
as the first character of p2 . We say that p1 is compatible with p2 . Further,
p = p1 ⊕ p2 , when p1 is compatible with p2 ,
where p is the concatenation of p1 and p2 with an overlap at the common end
and start character. For example, if
p1 = C . G
p2 = G . . T
186 Pattern Discovery in Bioinformatics: Theory & Algorithms

then p1 is compatible with p2 and

p = p1 ⊕ p2
= C. G . . T

Note that p2 is not compatible with p1 . Also, the locations list of p is appro-
priately generated as follows:

Lp = {(i, j) | (i, l) ∈ Lp1 , (l, j) ∈ Lp2 }.

The algorithm is straightforward and is explained through a concrete ex-


ample. The two points to bear in mind are as follows.

1. Amongst all possible candidate cells, always pick the most saturated
one (at each step). This ensures that the patterns are generated in the
desirable order: If p′ is nonmaximal w.r.t. a maximal pattern p, then p
is always generated (emitted) before p′ .

2. A pattern under construction is first extended to the right, till it can


be right-extended no more. Then it is extended to the left till it can be
left-extended no more. Then it is ready to be emitted. Before emitting,
it is checked against the emitted patterns to see if it is nonmaximal.

The overall algorithm could either be simply iterative or recursive (to take
advantage of partial computations). We describe a recursive version below
(note that the ‘backtrack’ in the discussion can be implicitly captured by
recursive calls).
The details of the algorithm are left as an exercise for the reader (Exer-
cise 70) which can be gleaned from the concrete example discussed below.
The reader is also directed to [ACP05] for other details.

Concrete example. For example, let

density constraint d = 2 and quorum k = 2,

with input string

s= C A G C A G T C T C.

In Steps 1 and 2 of the initialization, the set of cells B ′ is generated. Blef



t

shows the elements of B sorted by the starting element of the cell and and

each column follows the  ordering of the cells. Similarly, Bright shows the

elements of B sorted by the last elements of the cell and the elements of
each column is similarly ordered. Thus each column shows the ‘saturation’
Algorithms & Pattern Statistics 187

ordering of the cells and the cells are processed in the order displayed here.
 

 A G, C A, G C, T C,  
A . C, C T, G T, T . T, 

 

 

A . T, C . G, G . A, T . . T, 

 

 

A . . A, C . C, G . C, T − T.
 

Blef t = ,

 A . . C, C . . C, G . . G, 

A − C, C . . T, G . . T,

 


 

C − C, G − C,

 


 

C − T, G − T,
 
 

 C A, G C, A G, G T, 

G . A, T C, C . G, C T, 

 

 

A . . A, A. C, G . . G, A. T, 

 

 

C. C, T. T, 

 

 

G. C, C.. T, 

 
 

Bright = C.. C, G.. T, .
A.. C, C− T, 

 

 

T.. C, G− T, 

 

 

A− C, T− T. 

 

 

C− C,

 


 

G− C,
 

To avoid clutter, the cells that do not meet the quorum constraints have been
removed to produce Blef t and Bright . See Exercise 69 (1) for a mild warning
about this step.

 

 A G, C A, G − C, T C. 

A − C, C . G, G − T,
 
Blef t = ,

 C − C, 

C − T,
 
 

 C A, T C, A G, C − T, 

A− C, C . G. G − T.
 
Bright = .

 C− C, 

G− C,
 

Note that

(i, j) ∈ L

denotes that the cell begins at position i and ends at position j in the input.
Again, to avoid clutter, we do not enumerate all the location lists of the cells.
188 Pattern Discovery in Bioinformatics: Theory & Algorithms

We show only a few examples of cells with their location lists below.

LC A = {(1, 2), (4, 5)},


LA G = {(2, 3), (5, 6)},
LA − C = {(2, 4), (5, 8)},

We now show the steps involved in constructing the maximal extensible pat-
terns.
1. (Pick cell in order of saturation) Consider the cells that start with
A and have at least two occurrences. Then we have the following:
p1 = A G
p2 = A − C
What should the first choice be? Between the two, pattern p1 is more
‘saturated’ than p2 and p1 is picked first.

p1 = A G

2. (Explore right) Next, we pick cells that are compatible with p1 . We


look in Blef t for cells that start with G.
p3 = G − C,
p4 = G − T,

We pick p3 .

q1 = p1 ⊕ p3
= A G − C , and
Lq1 = {(2, 4), (5, 8)}.

3. (Explore right) We continue to explore the right and search for cells,
p, such that q1 is compatible with p. We look in Blef t for cells that start
with C.
p5 = C A
p6 = C . G
p7 = C − C
p8 = C − T
Adding p5 , p6 , p7 does not meet the quorum requirements. The only
option is p8 .

q2 = q1 ⊕ p8
= AG−C−T , and
Lq2 = {(2, 7), (5, 9)}.
Algorithms & Pattern Statistics 189

4. (Explore right) We continue to explore the right and search for cells,
p, such that q2 is compatible with p. We look in Blef t for cells that start
with T.
p9 = T C

q3 = q2 ⊕ p9
= AG−C−TC , and
L q3 = {(2, 8), (5, 10)}.

5. (Explore left) No more cells can be added to the right in q3 . Now we


try to add cells to the left of q3 and look for cells p such that p and q3
are compatible. We look in Bright for cells that end with A.

p5 = C A

Thus

q4 = p5 ⊕ q3
= CAG−C−TC , and
L q4 = {(1, 8), (4, 10)}.

6. (Emit maximal pattern) However, no more cells can be added to the


right or to the left of q4 , hence emitted.

q4 = C A G − C − T C

7. (Backtrack) We can backtrack to utilize some partial computations


(using the recursive call mechanism). As we backtrack, step 5 had only
one option; step 4 had only one option; the very last option was used in
step 3. However, in step 2, another option can still be explored.

8. (Explore right) We are back in the state of step 2 and at this stage
we wish to extend
p1 = A G
to the right with
p4 = G − T.
But this does not meet the quorum constraint, so we explore the left
now.

9. (Explore left) We look in Bright for cells that end in A.

p5 = C A
190 Pattern Discovery in Bioinformatics: Theory & Algorithms

Thus

q5 = p5 ⊕ p1
= C A G , and
L q5 = {(1, 3), (4, 6)}.

10. (Explore left) We explore extending q5 to the left without success.

11. (NO emit) Before emitting q5 , it is checked for maximality against the
emitted pattern q4 and it turns out that q5 is nonmaximal w.r.t. q4 ,
hence it cannot be emitted.

12. (Backtrack) We must backtrack to step 1. In step 1, we can now use


the second choice and use cell p2 :

p2 = A − C

13. (Explore right) We explore the right and search for cells, p, such that
p2 is compatible with p :

p5 =C A
p6 =C .G
p7 =C −C
p8 =C −T

Adding p5 , p6 , p7 does not meet the quorum requirements. The only


option is p8 .

q6 = p2 ⊕ p8
= A−C−T , and
L q6 = {(2, 7), (5, 9)}.

14. (Explore right) We continue to explore the right and search for cells,
p, such that q6 is compatible with p:

p9 = T C

Adding p9 does not meet the quorum requirements. No more cells can
be added to the right in q6 .

15. (Explore left) Now we try to add cells to the left of q6 and look for
cells p such that p and q6 are compatible:

p5 = C A
Algorithms & Pattern Statistics 191

Thus
q7 = p5 ⊕ q6
= CA−C−T , and
Lq7 = {(1, 7), (4, 9)}.
However, no more cells can be added to the right or to the left of q7 .
16. (Emit maximal pattern) q7 is checked against q4 which was emitted
before. q7 is not nonmaximal w.r.t. q4 , hence emitted.

q7 = C A − C − T

17. (Repeat iteration) In fact, we should repeat the whole process with
cells starting with C, G and T as well. But we skip those details here.
See Exercises 70 and 71 for other details on the algorithm.

7.3 Pattern Statistics


We have seen in the last sections that the specification of a pattern can be
very flexible, resulting in a large number of legitimate patterns in an input.
Can we assign a level of statistical significance to these patterns?
In Chapter 5 we have seen that biopolymers can be modeled in various
ways. In the following sections, we discuss the probabilities of the occurrence
of the various classes of string patterns under these different models (i.i.d.
and the Markov model).

7.4 Rigid Patterns


The discussion here can be easily adapted for solid patterns as a special
case.
We begin our treatment by deriving some simple expressions for the prob-
ability, prq , of a rigid pattern q over
Σ + ‘.’
Let q be a rigid pattern generated by an i.i.d. source (see Chapter 5) which
emits σ ∈ Σ with probability
prσ .
192 Pattern Discovery in Bioinformatics: Theory & Algorithms

Note X
prσ = 1.
σ
Let the number of times σ appears in q be given by
kσ .
Then probability of occurrence of q, prq , is given as
Y
prq = (prσ )kσ . (7.1)
σ∈Σ

Thus, the dot character implicitly has a probability of 1. This fact alone
can raise some debate about the model, but we postpone this discussion to
Section 7.5.3.

Markov Chain
Next, we obtain the form of prq for a pattern when input q is assumed to
be generated by a Markov chain (see Chapter 5). For the derivation below,
we assume the Markov chain has order 1. Let
prσ(k)
1 ,σ2

denote the probability of moving from σ1 to σ2 in k steps.


Let s be a stationary, irreducible, aperiodic Markov chain of order 1 with
state space Σ (|Σ| < ∞). Further,
πσ
is the equilibrium (stationary) probability of σ ∈ Σ and the
(|Σ| × |Σ|)
transition probability matrix is as follows:
P [i, j] = prσ(1)
i ,σj
.
Recall the definition of cell from Section 7.2. For a rigid pattern q, for each
cell
hσ1 , σ2 , ℓi ∈ C(q),
is such that ℓ ≥ 0. It is easy to see that when ℓ ≥ 0, the cell represents the
(ℓ+1)-step transition probability given by P ℓ+1 , i.e.,
prσ1 (·)ℓσ2 = P ℓ [σ1 , σ2 ].
Thus for a rigid pattern q ′ ,
 
Y
prq′ = πq′ [1]  P ℓ [σ1 , σ2 ] . (7.2)
hσ1 ,σ2 ,ℓi∈C(q ′ )
Algorithms & Pattern Statistics 193

7.5 Extensible Patterns


Let q be an extensible pattern with density constraint d, i.e., the extensible
gap character ‘-’ is to be interpreted as upto d dot ‘.’ characters. R(q) is the
set of all possible realizations of q and each realization is a rigid pattern. For
example if
q = A . C − G,
and d = 4, the realizations of q are:
R(q) = {q0 , q1 , q2 , q3 , q4 },
where
q0 = A . C G
q1 = A . C .G
q2 = A . C .. G
q3 = A . C .. . G
q4 = A . C . . . . G

Extensible patterns display various characteristics that makes the probabil-


ity computation nontrivial. See Exercise 81 for an illustration. Let
R(q) = {q1 , q2 , . . . , qr }.
Note that
prq = prR(q)
= prq1 +q2 +...+qr .
where prq1 +q2 +...+qr is the probability of occurrence of q1 or q2 or . . . qr .
If no two qi ’s in R(q) co-occur, then we can simply add the individual
probabilities of qi ’s. But, unfortunately, it is possible that they can co-occur.
In other words, it is possible for an extensible pattern to occur multiple times
at the same location i. Continuing the example, q occurs three times (as q1 ,
q2 and q3 ) at position 1 of s as shown below:
s= A GCAG GG
q1 on s AGCAG GG
q2 on s AGCAGG G
q3 on s AGCAGGG

So, we consider two kinds of extensible patterns:


194 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. (nondegenerate) Extensible patterns that can occur only once at a po-


sition in the input.
2. (degenerate) Extensible patterns that may occur multiple times at a
position in the input.

7.5.1 Nondegenerate extensible patterns


Here we discuss nondegenerate extensible patterns or the ones that occur
only once at a site. Then
X
prq = prqj .
qj ∈R(q)

Hence we need to compute prqj where qj is a rigid pattern. Assume qj is a


rigid pattern with no dot characters. From Equation (7.1),
Y
prqj = (prσ )kσ ,
σ∈Σ

where σ appear kσ times in qj . In other words, only the solid characters


contribute nontrivially to the computation of prqj . Hence, if q is not rigid,
X
prq = prqj
qj ∈R(q)
X Y
= (prσ )kσ
qj ∈R(q) σ∈Σ
Y
= |R(q)| (prσ )kσ
σ∈Σ

Annotated gaps. Sometimes, the extensibility of an extensible pattern


may be represented, not by d but by a special annotation, say α, for a gap
character. Note that d always represents the following possible dot characters

{1, 2, . . . , d}.

It is possible to represent an arbitrary collection of gaps, say such as

α = {2, 4, 5, 7}.

Let the number of extensible gaps be e each with annotation (set)

αi , 1 ≤ i ≤ e.

For example an extensible pattern q could be written as

q = A .{1,2} C .{2,3} G,
Algorithms & Pattern Statistics 195

where the annotation sets are


α1 = {1, 2},
α2 = {2, 3}.
All the rigid realizations of q are:
q1 = A . C . . G
q2 = A . C . . . G
q3 = A . . C . . G
q4 = A . . C . . . G
Thus the number of possible rigid realizations of q is all possible combinations
(l1 , l2 , . . . , le ),
where each li ∈ αi and the total number of rigid realizations is given by:
e
Y
|αi |.
i=1

Then e
Y Y

prq = (prσ ) |αi |. (7.3)
σ∈Σ i=1

But e
Y
|R(q)| = |αi |.
i=1
Hence,
Y
prq = |R(q)| (prσ )kσ . (7.4)
σ∈Σ

Multi-sets (homologous sets). Consider the case where a solid character,


q[i], of the pattern is a set of homologous characters. Then, since only one of
the homologous characters may appear at an occurrence, the probability of
occurrence, prq[i] , of q[i] is given by
X
prq[i] = prσ . (7.5)
σ∈q[i]

Thus, if q is a nondegenerate extensible pattern on homologous sets, using


Equations (7.4) and (7.5), its probability of occurrence is given by
 
Y X
prq = |R(q)|  prσ  . (7.6)
q[i]6=‘.′ ,‘-′ σ∈q[i]
196 Pattern Discovery in Bioinformatics: Theory & Algorithms

Markov chain
If q is a nondegenerate extensible pattern then,
X
prq = prq′ . (7.7)
q ′ ∈R(q ′ )

Using Equations (7.2) and (7.7), for a nondegenerate extensible pattern q,


using the Markov chain model, we have
  
X Y
prq = πq[1]   P ℓ [σ1 , σ2 ] . (7.8)
q ′ ∈R(q) hσ1 ,σ2 ,ℓi∈C(q ′ )

When sets of characters or homologous sets are used in patterns, the cell
is appropriately defined so that σ1 and σ2 are sets of homologous characters,
possibly singletons. Then the following holds.
   
    
X  X  Y  X 
prq =  πσ   P ℓ [σa , σb ] (7.9)
   
 
′   
σ∈q[1] q ∈R(q)  hσ1 , σ2 , ℓi  σa ∈ σ1 , 
∈ C(q ′ ) σb ∈ σ2

7.5.2 Degenerate extensible patterns


Next we consider the scenario where the patterns possibly occur multiple
times at a position i in the input. The following discussion holds for both the
i.i.d. and the Markov model.
Let M l (q) denote a set of strings that has only the solid characters of at
least l occurrences of the pattern q at position one of the string. Here we
display every other character (not solid for pattern q) as 2. For example,
consider the pattern
q = C − G.
with
R(q) = {C . G, C . . G, C . . . G}.
q occurs once on each q ′ ∈ M 1 (q), where,
M 1 (q) = { C 2 G,
C 2 2 G,
C 2 2 2 G}.
q occurs twice on each q ′ ∈ M 2 (q), where
M 2 (q) = { C 2 G G,
C 2 2 G G,
C 2 G 2 G}.
Algorithms & Pattern Statistics 197

q occurs three times on q ′ ∈ M 3 (q), where

M 3 (q) = { C 2 G G G}.

We will compute the probability of q ′ ∈ M l , for each l, and q ′ is treated like a


rigid pattern and the probability of occurrence of the 2 character is assumed
to be 1 (like the dot character). The probability of the occurrence of the set
M l (q), pr(M l (q)), is given by
X
pr(M l (q)) = prq′ .
q ′ ∈M k (q)

Let q be a degenerate (possibly with multiple occurrences at a site) exten-


sible pattern and
|R(q)| = r.
Then, using the inclusion-exclusion principle (see Theorem (3.2) in Chapter 3),

prq = pr M 1 (q) − pr M 2 (q) + . . . + (−1)r+1 pr (M r (q))


 
(7.10)
r−1
X
(−1)k pr M k+1 (q) .

= (7.11)
k=0

Approximating the probability. Notice that for a degenerate pattern,


Equation (7.6) is the zeroth order approximation of Equation (7.11). The
first order approximation is

prq ≈ pr M 1 (q) − pr M 2 (q) ,


 

and the second order approximation is

prq ≈ pr M 1 (q) − pr M 2 (q) + pr M 3 (q) ,


  

and so on.
Using Bonferroni’s inequalities (see Chapter 3), if k is odd, then a kth order
approximation of prq is an overestimate of prq .

7.5.3 Correction factor for the dot character


When two patterns q1 and q2 both get a significant score (as defined in
Equation (7.13)), and are very close to each other, it becomes desirable to
discriminate one from the other. This calls for a more careful evaluation of
the score [ACP07].
In the previous definitions, we used the assumption that a pattern is gen-
erated by a single stationary source. This model undergoes the restriction
that also a mismatch is produced by such source, whereas in reality it is the
198 Pattern Discovery in Bioinformatics: Theory & Algorithms

concatenation of a series of events that generate this mismatch. We revisit


the earlier model and refine the treatment of the wild card under the i.i.d.
assumption. The dot character is treated as ‘any’ character emitted by the
source and thus its probability is assigned to be 1. However, in computing the
probability of the leftmost occurrence of a pattern the dot character actually
corresponds to a mismatch. A mismatch occurs when in comparing two input
sequences at particular positions the two characters differ. This probability
as the complement of having two independent extractions from an urn return
the same character, hence:
X
prdot = 1 − prσ2 .
σ∈Σ

Expression (7.3) is corrected as:


Y e 
Y 
prq = (prσ )kσ |αi | pr αi (7.12)
dot
σ∈Σ i=1

Using
prdot < 1,
instead of
prdot = 1,
could be interpreted as a probabilistic way to to include a ”gap penalty” in
the previous formulation.

7.6 Measure of Surprise


Irrespective of the particular model or representation chosen, the tenet of
pattern discovery equates overrepresentation (or underrepresentation) of a
motif with surprise, hence with interest. Thus, any motif discovery algorithm
must ultimately weigh motifs against some threshold, based on a score that
compares empirical and expected frequency, perhaps with some normalization.
The departure of a pattern q from expectation is commonly measured the by
so-called z-scores ([LMS96]), which have the form
f (q) − E(q)
z(q) = ,
N (q)
where
1. f (q) > 0 represents the observed frequency,
2. E(q) > 0 an expectation and
Algorithms & Pattern Statistics 199

3. N (q) > 0 is a normalization function.


For a given z-score function, set of patterns P , and real positive threshold α
(see Section 3.3.4 for a discussion on threshold α), patterns such that

z(q) > α

or
z(q) < −α
are respectively overrepresented or underrepresented, or simply surprising.

7.6.1 z-score
Let prq be the probability of the pattern q occurring at any location i on
the input string s with
n = |s|
and let kq be the observed number of times it occurs on s. Assuming that the
occurrence of a pattern p at a site is an i.i.d. process, ([Wat95], Chapter 12),
for large n and kq ≪ n,
k − nprq
p q → N ormal(0, 1).
nprq (1 − prq )
See Chapter 3 for properties of normal distributions. Thus the z-score for a
pattern q is given as
kq − nprq
z(q) = p . (7.13)
nprq (1 − prq )

7.6.2 χ-square ratio


The z-score in not the only way to measure events that occur with unex-
pected frequency. In applications related to classification and clustering, such
as, e.g., with protein families, a pattern q is considered to be overrepresented
if a surprisingly large number of sequences from an ensemble contain each at
least one occurrence of q. In this context, a large total number of occurrences
of q in any particular sequence is immaterial and may be misleading as a
measure, since the relevant fact is that the motif is shared across multiple
sequences.
Let prq be the probability assigned to motif q, computed according to any
of the models above. Assuming t sequences

s1 , s2 , ..., st ,

to be given, the expected number of occurrences of the pattern in si is ap-


proximately
µi = prq |si |.
200 Pattern Discovery in Bioinformatics: Theory & Algorithms

By the law of rare events (Poisson distribution), the probability of finding q


at least once in si is
pr (i) = 1 − e−µi .
See Chapter 3 for a discussion on Poisson distributions. Then the expected
number of sequences containing q at least once is
t
X
ke = pr (i)
i=1
t
X
1 − e−µi

=
i=1
t
X
= t− e−µi
i=1
Xt
= t− e−prq |si | .
i=1

Let kq be the observed number of sequences that contain q. Then the statisti-
cal significance of a given discrepancy between the observed and the estimated
is assessed by taking the χ-square ratio as follows:

(kq − ke )2
χ(q) = .
ke

7.6.3 Interplay of combinatorics & statistics


Pattern discovery methods use combinatorial checks such as maximality
(and redundancies) and statistical ones such as pattern probabilities as mech-
anisms to trim the collection of patterns. So do these diverse criteria corrob-
orate or contradict each other?
Note that if pattern q1 is nonmaximal w.r.t. q2 , then

prq1 > prq2 ,

and
kq1 = kq2 ,
where kq1 is the observed frequency of q1 and kq2 is that of q2 . Further let,

1
prq1 , prq2 < .
2
Then the z-scores of the two patterns satisfy the following [ACP05]:

z(q1 ) ≥ z(q2 ).
Algorithms & Pattern Statistics 201

Thus, it is reassuring to learn that the observed frequency of occurrence be-


ing equal, a maximal motif always has a smaller probability than the non-
maximal version, hence its degree of surprise (overrepresentation) is only
stronger [ACP05, ACP07].
Thus, roughly speaking, a pattern that is combinatorially ‘uninteresting’ is
so statistically as well.

Rank z-score Motif

1 7,60E+07 RA.T[LV].C.P-(2,3)G.HP....AC[ATD].L....[ASG]

2 21416,8 A..[LV].C.P-(2,3)G.HP-(1,2,4)[ASG].[ATD]

3 8105,33 A-(1,4)T....P-(2,3)G.HP....[ATD]-(3)L....[ASG]

4 5841,85 [ATD].T....P-(1,2,3)G.HP-(1,2,4)A.[ATD]

5 4707,62 P.[ASG]-(2,3,4)P....AC[ATD].L....[ASG]

6 4409,21 A..[LV]...P-(2,3)G.HP-(1,2,4)A.[ATD]

7 3086,17 P-(1,2,3)[ASG]..P-(4)AC[ATD].L....[ASG]

8 3068,18 R..[ATD]....P-(2,3)G.HP-(1,2,4)[ASG].[ATD]

9 2615,98 [ASG][ATD]-(1,3,4)P....AC[ATD].L....[ASG]

10 2569,66 [ASG]-(1,2,3,4)P....AC[ATD].L....[ASG]

11 2145,6 G-(2,3)P....AC[ATD].L....[ASG]

FIGURE 7.1: The functionally relevant motif is shown in bold for Strep-
tomyces subtilisin-type inhibitors signature (id PS00999). Here 20 sequences
of about 2500 bases were analyzed.

7.7 Applications
We conclude the chapter by showing some results on protein and DNA se-
quences obtained by using the ideas in the chapter. The experiments 1 involve
automatic extraction of significant extensible patterns from some suitable col-
lection of sequences. The interested reader is directed to [ACP07] for further
details.

1 Theexperiments use a system called Varun which is accessible at:


www.research.ibm.com/computationalgenomics.
202 Pattern Discovery in Bioinformatics: Theory & Algorithms

Rank z-score Motif

1 295840 [LIM]-(1,2,3,4)[STA][FY]DPC[LIM][ASG]C[ASG].H

2 2,86E+05 [LIM]-(1,2,3,4)[ASG][FY]DPC[LIM][ASG]C[ASG].H

3 155736 R-(1,4)[FY]DPC[LIM][ASG]C[ASG].H

4 78829 [LIM]-(1,2,3,4)[STA].DPC[LIM][ASG]C[ASG].H

5 76101,9 [LIM]-(1,2,3,4)[ASG].DPC[LIM][ASG]C[ASG].H

6 34205,6 [STA]-(1,4)DPC[LIM][ASG]C[ASG].H

7 30325,1 [LIM]-(1,2,3,4)[STA][FY]D.C[LIM][ASG]C..H

8 29276 [LIM]-(1,2,3,4)[ASG][FY]D.C[LIM][ASG]C..H

9 20527,3 [ASG]-(1,4)DPC[LIM][ASG]C[ASG].H

10 17503,4 [LIM]-(1,2,3,4)[ASG]..PC[LIM][ASG]C[ASG].H

FIGURE 7.2: The functionally relevant motifs are shown in bold for Nickel-
dependent hydrogenases (id PS00508). Here 22 sequences of about 23,000
bases were analyzed.

On protein data. Streptomyces subtilisin-type inhibitors (id PS00999): Bac-


teria of the Streptomyces family produce a family of proteinase inhibitors
characterized by their strong activity toward subtilisin. They are collectively
known as SSIs: Streptomyces Subtilisin Inhibitors. Some SSIs also inhibit
trypsin or chymotrypsin. In their mature secreted form, SSIs are proteins of
about 110 residues [TKT+ 94]. The functionally significant motif is discovered
as the top ranking one out of 470 extensible motifs (Figure 7.1).
Nickel-dependent hydrogenases (id PS00508): These are enzymes that cat-
alyze the reversible activation of hydrogen and is further involved in the bind-
ing of nickel and which occur widely in prokaryotes as well as in some eukary-
otes. There are various types of hydrogenases, but all of them seem to contain
at least one iron-sulfur cluster. They can be broadly divided into two groups:
hydrogenases containing nickel and, in some cases, also selenium (the [NiFe]
and [NiFeSe] hydrogenases) and those lacking nickel (the [Fe] hydrogenases).
The [NiFe] and [NiFeSe] hydrogenases are heterodimer that consist of a small
subunit that contains a signal peptide and a large subunit [VCP+ 95]. All the
known large subunits seem to be evolutionary related; they contain two cys-
teine motifs; one at their N-terminal end; the other at their C-terminal end.
These four cysteines are involved in the binding of nickel. In the [NiFeSe] hy-
drogenases the first cysteine of the C-terminal motif is a selenocysteine which
has experimentally been shown to be a nickel ligand. Again, this functionally
significant motif is detected in the top three out of 4150 extensible motifs
(Figure 7.2).
Algorithms & Pattern Statistics 203

Rank z-score Motif

1 24,3356 TTTGCTCA

2 16,1829 AAAAATGT

3 16,1829 AACTTAAA

4 16,1829 AAATCATG

5 16,0438 TTTGCTC

6 11,9715 ATAAAAA

7 11,9715 AAAAATG

8 11,9715 ACTTAAA

FIGURE 7.3: Motifs extracted from DNA sequences of the transcriptional


factor: CuRE.

DNA sequences The system automatically discovers the published motifs


for CuRE and UASGABA data in the top positions as shown in Figures 7.3
and 7.4.

7.8 Exercises
Exercise 69 (Cells) Consider the discovery method discussed in Section 7.2.
Let d be the density parameter and the alphabet is Σ.
1. Construct an example s to show that it is possible that a cell p occurs
k times in s but a maximal pattern p′ where p is a substring of p′ may
occur more than k times in s.
2. If B is the collection of all cells in the input at the initialization phase,
show that
|B| ≤ (2 + d) |Σ|2 .
3. Prove that if the cells are processed in the order of saturation, the max-
imal motifs are emitted before their nonmaximal versions.
Hint: 1. Let d = 3, then for each pair, say, A, C ∈ Σ, the possible d + 2 cells
are:
A C, A . C, A . . C, A . . . C, A − C.
204 Pattern Discovery in Bioinformatics: Theory & Algorithms

Rank z-score Motif

1 8469,49 G.CAAAA.CCGC.GGCGG.A.T

2 1056,48 A.CGC.GCTT.G.AC.G.AA

3 528,79 GG.A.TC.T.T.G.TA.T.GC

4 527,143 TT.GA.ATG.TTT.T.TC

5 263,566 GT.CG.T.AT.G.ATA.G

6 263,293 TT.TC.T.C.CC.AAAA

7 263,293 GAT.ATA.AA.A.AG.A

8 263,293 CA.A.TA.TCA.TT.CT

9 263,293 T.TA.G.T.TTT.CTTC

10 263,022 T.ATA.T.TATTAT.A

11 131,499 ATA.A.AA.AG.A.AA

12 131,499 T.TTT.CTT.T.CC.A

13 131,364 G.TGT.AT.AT.TAA

14 131,229 C.T.AATAA.AAAT

15 131,229 TAT.G.TAATC.CT

FIGURE 7.4: Motifs extracted from DNA sequences of the transcriptional


factor: UASGABA.
Algorithms & Pattern Statistics 205

3. Let density constraint d = 2. The extensible pattern p and its two occur-
rences. Cell, C T, occurs only once in the input s.

s=AGAGCT
p=AG−CT
AG AGCT
AG AG CT

Exercise 70 (Discovery algorithm) Consider the discovery method dis-


cussed in Section 7.2.

1. What is the running time complexity of the algorithm? Suggest some


heuristics to improve the running time.

2. Generalize the method to handle sequences on multi-sets.

Hint: 1. If in the processing of a cell, all its locations have been used, can it
be removed from the B set?

Exercise 71 (Generalized suffix tree) How is the ‘suffix’ tree shown be-
low related to the algorithm discussed in Section 7.2. Note that the labels in
the edges of the tree use the ‘.’ and ‘-’ symbols. The density parameter d = 2
206 Pattern Discovery in Bioinformatics: Theory & Algorithms

and the quorum is k = 2.


Z
ydaxd$ d
a
A F
cdabydaxd$ a $
7 x
3 G
bydaxd$ 11
x −d 6
B
d$ bydaxd$ xd$
cdabydaxd$ bydaxd$
D
2 10 8
..da 4
C
φ axd$ φ abydaxd$
$
cdabydaxd$ d$
E
5 5 9 1
1 9
bydaxd$ −dxd$
1 H
5
axd$ $
axcdabydaxd$ 1 5
Hint: This is taken from [CP04]. Some steps in the construction of this tree
are shown below.

x .c
bydaxd$ .y .d ..d −d
a
{1} {5} {1,5} {1,5,9}
5 {9}
d$

x .c cdabydaxd$
b .y .d ..d −d
1 9
(1) (2)

a a

x .c x .c
bydaxd$ .y .d ..da-d -d bydaxd$ .y .d ..da-d -d

{1} {5} {1,5,9} {1} {5} {1,5,9}


5 {9} 5 {9} $ $
d$ axd$ $ d$ axd$ φabydaxd$
φaxd$ 9
cdabydaxd$ 1 5 cdabydaxd$ 1 1
5 5
1 9 1 9

(3) (4)
Algorithms & Pattern Statistics 207

Exercise 72 (Density constraints (d, D)) The density constraint d de-


notes the maximum number of gaps (dot characters) between any two con-
secutive solid characters in the pattern. Sometimes another constraint, D, is
used which gives the minimum length of each rigid segment (having both solid
and dot characters) in the pattern. Thus a segment between two consecutive
extensible gap (‘−’) characters is constrained to be of length D.
Modify the algorithm of Section 7.2 to include this D constraint.
1. Incorporate the change at the iteration phase.
2. Incorporate the change at the initialization phase.
Hint: 2. Is it possible to first build only the rigid segments (using rigid cells)
thus enlarging the set of cells B. Note that there can be no pattern of the
form, say,
A C . − G,
where the dot character is adjacent to the ‘−’ character.

Exercise 73 (Maximality check) In the algorithm of Section 7.2, a pattern


p1 is checked for maximality against, say some p2 , before it is emitted. Let

|Lp1 | = |Lp2 |.

Devise a method to check if extensible pattern p1 is nonmaximal w.r.t. p2 ,


without checking the location lists Lp1 and Lp2 .
Hint: For rigid patterns p1 and p2 , p1 needs to be aligned at some position j
from the left. For extensible motifs, check for solid elements that are in one
but not in the other.

For the problems below, see Section 5.2.4 for a definition of random
strings.

Exercise 74 Let
p = G A A T T C.
1. What is the odds of seeing p at least 10 times in a 600 base long random
DNA fragment?
2. What is the expected copy number of p in a random DNA fragment
of length n (i.e., how many times do you expect to see p in the DNA
fragment)?
208 Pattern Discovery in Bioinformatics: Theory & Algorithms

Exercise 75 Assume that the chance of occurrence of purines (adenine or


guanine) and pyramidines (thymine or cytosine) in a selected strand of DNA
is the same. Then, is the chance of seeing
at least 7 purines in a random DNA strand of 10 bases
the same as seeing
at least 14 in a random strand of 20 bases?
Why?
Hint: Yes, it is a trick question.

Exercise 76 Assume a random DNA sequence s of length n. Let prz denote


the probability of occurrence nucleotide

z = {A, C, G, T}

in s. Let
prA = prC = prG = prT = 0.25.
1. What is the probability of seeing a pattern p of length l?
2. How do the odds change if p must occur at least k > 1 times?
3. What can you conclude if n ≫ l?
What are the answers to the three questions under the assumption:

prA = prT = 0.2


prC = prG = 0.3.

Exercise 77 In a random string on

{A, C, G, T}

(note each character occurs with equal probability), what is the longest stretch
of A’s you expect to see?

Exercise 78 (Restriction enzyme) A position specific restriction enzyme


cleaves double stranded DNA at restriction sites characterized solely by the
base composition. Thus a DNA sequence can be broken up into fragments.
Assuming the restriction site to be the pattern

G A A T T C,

what is the average length of a fragment? Assume that each base occurs at a
position with equal probability and independently.
Algorithms & Pattern Statistics 209

Exercise 79 What is the probability of occurrence of a 10-mer consisting of


at least

1. two A’s,

2. at least one C,

3. at least one G and

4. at least two T’s

in a random DNA sequence?

Hint: See Section 11.2.2.

Exercise 80 (Restriction site) Let the probability p of an occurrence of a


mutation (or a restriction site) be

1
.
1024
What is the average distance between two mutation (or restriction sites)?
What is the standard deviation?

Hint: The probability distribution of the number k of Bernoulli trials needed


to get one success, defined on the domain

Ω = {1, 2, 3, . . .}

is called a geometric distribution. The probability mass function is given by

pr(k) = (1 − p)k−1 p.

A random variable following this distribution is usually denoted as

X ∼ Exponential(k; p)

Show that
1
E[X] = ,
p
1−p
V [X] = .
p2
210 Pattern Discovery in Bioinformatics: Theory & Algorithms

Exercise 81 (Pitfall of extensibility) Let q be an extensible pattern on


some input with the set of realizations,

R(q) = {q1 , q2 , . . . , ql },

where l > 1. Since at any location in the input one of R(q) occurs, the
probability of occurrences of q is the sum of the probability of occurrence of
the rigid motif
qj ∈ R(q), 1 ≤ j ≤ l.
Thus the probability of occurrence of q, prq , is given as
X
prq = prqj .
qj ∈R(q)

What is wrong with this computation of prq ?


Hint: Consider the following example. Let Σ = {A, C}, with

prA = prC = 0.5

Let extensible pattern


q=A−C
and the realizations of q be

R(q) = {A C, A . C, A . . C, A . . . C, A . . . . C}.

Then

prA C = prA . C = prA . . C


= prA . . . C = prA . . . . C = 0.25

It follows

prq = 1.25 > 1.

Exercise 82 Let
Σ = {A, C, F, G, L, V}
and the probability of occurrence of σ ∈ Σ is prσ with
X
prσ = 1.
σ∈Σ

Let
q1 = A C . . L
Algorithms & Pattern Statistics 211

1. Compute the probability of occurrence of q1 or q2 in a random string


where
q2 = G V L F F

2. Compute the probability of occurrence of q1 or q2 in a random string


where
q2 = A . L F

Hint: 2. q1 and q2 may co-occur; use Inclusion-Exclusion Principle.

Exercise 83 1. Let pattern q1 be nonmaximal w.r.t. q2 . Show that under


i.i.d. distribution,
prq2 < prq1 .

2. Let pattern q1 be redundant w.r.t. q2 . Show that under i.i.d. distribu-


tion,
prq2 < prq1 .
Do the same results hold under a Markov model?

Exercise 84 (Alternative definition of maximality) Recall the maxi-


mality definition discussed in the chapter: Given s, q1 is nonmaximal w.r.t.
q2 if and only if,

q1 ≺ q2 and
L q1 = L q2 .

Consider an alternative definition of maximality: Given s, pattern q1 is non-


maximal w.r.t. q2 if and only if

q1 ≺ q2 .

1. Let
s1 = A C G T A C G T C G T G T.
Enumerate the maximal solid patterns in s1 for each of the maximal
definitions.
2. Let
s2 = A C G T A C G T.
Enumerate the maximal solid patterns in s2 for each of the maximal
definitions.
3. Then is it true that the z-scores satisfy z(q1 ) ≥ z(q2 )?
4. Compare the two definitions of maximality.
212 Pattern Discovery in Bioinformatics: Theory & Algorithms

Hint: Definition 1:

Pmaximal (s1 ) = {A C G T, C G T, G T},


Pmaximal (s2 ) = {A C G T}.

Definition 2:

Pmaximal (s1 ) = Pmaximal (s2 )


= {A C G T}.

3. Note that, under this definition, it is possible that

|Lq1 | > |Lq2 |.

4. How do the two z-scores compare in each of the scheme?


Chapter 8
Motif Learning

The science of learning


is indeed a fine art.
- anonymous

8.1 Introduction: Local Multiple Alignment

We ask a basic question: What is a motif in a biological sequence? One


possible meaningful definition is to look for structural or functional implica-
tions of a segment and if this can be (unambiguously) associated with the
segment, then the segment qualifies to be a motif. To put it another way, a
segment shared by multiple protein or nucleic acid sequences is a motif as it
could possibly tell us about evolution, structure or even function.
Using this premise, in the absence of supporting information such as three-
dimensional structures or details of chemical interactions of residues or effects
of mutations or even function, is it possible to identify segments as ‘potential’
motifs? The recognition of these segments then relies on patterns shared by
multiple protein or nucleic acid sequences. To elicit these shared regions, the
most appropriate approach is to align the sequences. The alignment could be
global, where the task is to align the entire sequences as best as possible or it
could be local where the focus is on shorter segments of the sequences. These
short segments qualify as motifs. In other words motifs can also be viewed
as the consensus derived from a local multiple alignment of sequences. This
process is also called learning motifs from the sequences.
This is complicated by the fact that the target motif (or sometimes also
called a signal) may vary greatly among different sequences (say proteins).
The challenge is to discover these subtle motifs. An example of such a motif is
shown in Figure 8.1 which is taken from [LAB+ 93]. It shows repeating motifs
in prenyltransferase subunits: Ram1 (Swiss-Prot accession number P22007)
and FT-β (Swiss-Prot Q02293) are the β subunits of farnesyltransferase from
the yeast Saccharomyces cerevisiae and rat brain respectively. Cdc43 (Swiss-
Prot P18898) is the β subunits of type II geranylgeranyltransferase from the

213
214 Pattern Discovery in Bioinformatics: Theory & Algorithms

129 --- GPFGGGPGQLSH LA-


181 --- GGFKTCLEVGEV DTR
230 --- GGFGSCPHVDEA HGG
Ram1
279 --- RGFCGRSNKLVD GC-
331 --- PGLRDKPQAHSD FY-
SSYSCTPNDSPH

122 --- GGFGGGPGQYPH LA


173 --- GSFLMHVGGEVD VR
FT-β 221 --- GGIGGVPGMEAH GG
269 --- GGFQGRCNKLVD GC
331 --- GGLLDKPGKSRD FY

191 --- GAFGAHNEPHSG --


Cdc43 240 SDD GGFQGRENKFAD TC
309 --- GGFSKNDEEDAD LY

FIGURE 8.1: An example of sequences aligned by the motif that is shown


boxed. Notice the extent of dissimilarity of the motif across its occurrences.

rat brain.1

8.2 Probabilistic Model: Motif Profile


Recall that a motif is simply a sequence of characters from the alphabet
possible interspersed with dont care characters. For example a motif

GKK.DD

is interpreted as a segment whose

1. first element is always G (glycine),

2. second and third elements are always K (lysine),

3. fifth and sixth elements are always D (asparatic acid) and

4. the fourth element could potentially be any residue.

1 See the cited paper for any further details on this example.
Motif Learning 215

The emphasis is here on ‘always’. Clearly, the fourth element violated the
‘always’ criterion giving a ‘not always’ condition, hence delegated down to a
dont care. 2 So, can we find a middle ground between ‘always’ and dont care?
A probabilistic model of a motif associates a real number between zero and
one (probability) with each element (residue) may occur in a sequence. This
is defined formally below.
Consider an alphabet of size L = |Σ| as follows.

Σ = {σ1 , σ2 , . . . , σL }.

A motif of size l is then defined by an |Σ| × l matrix ρ


 
ρ11 ρ12 . . . ρ1l
 ρ21 ρ22 . . . ρ2l 
 
ρ =  ρ31 ρ32 . . . ρ3l  . (8.1)
 
 .. .. .. 
 . . . 
ρr1 ρr2 . . . ρrl

The rough interpretation of ρrc is that it is the probability of position c in


the motif to be σr . Thus, for each 1 ≤ j ≤ l,
X
ρij = 1.
1=1

In other words, each column of matrix ρ adds up to 1. This matrix ρ is also


called the probability matrix.
For ease of exposition, in the rest of the chapter, the subscript r in ρrc may
also be written as
ρ σr c .

8.3 The Learning Problem


The central task is defined as follows: Given t sequences si defined on some
finite alphabet Σ of length ni each and a motif length l, the task is to find a
motif of length l that is shared in all the t sequences.
In the most general scenario:

1. the motif may or may nor occur in a sequence and

2. when it does, the motif may occur multiple times in a sequence.

2 In fact this discontinuity gets in the way of elegant formalization under some combinatorial

models.
216 Pattern Discovery in Bioinformatics: Theory & Algorithms

On occurrences. If the motif were to occur at most once in each sequence,


it is usually called unsupervised learning. This means that a motif may or
may may not occur in each sequence.
If the motif occurs at least once in each sequence, the discovery of motif
under such condition is usually called supervised learning.
We will consider the very special case when the motif occurs exactly once
in each sequence in the following discussion.
The million dollar question is: when does ρ deserve the dignity of a motif?
In principle, any alignment of a segment of length l of the t sequences can
potentially produce a probability matrix ρ. We address this issue by insisting
that the motif discovered is the best amongst all such l length motifs. This
has two implications:

1. A quantitative measure, F , must be defined

F : R → R,

where R is the set of all motif profiles ρ (of the same length l). In other
words, for a given collection of input sequences,

F (ρ) = v(∈ R).

We discuss possible forms of F in the following section.

2. The problem will produce only one result for a fixed l.

8.4 Importance Measure


Given two motif profiles on an input data set, how can they be compared?
In the absence of any other supporting information, we discuss below two
ways of comparing motifs. The first uses a log likelihood measure and the
second uses information content.

8.4.1 Statistical significance


The task is to assign a numerical value that can be associated with the
(statistical) significance of the motif. The

log(likelihood)

is such a measure of the profile. The higher the value, usually the more
(statistically) significant the motif profile.
The positions in the input data is divided into
Motif Learning 217

1. motif positions and

2. nonmotif (or background) positions.

Further, each motif position has an associated column c depending on what


position in the motif covers this position in the input. Thus each position is
annotated as follows


 0 sij is a nonmotif position,



 1 sij is position 1 of the motif,


 2 sij is position 2 of the motif,
.

Cij = ..
c sij is position c of the motif,




 ..

.




l sij is position l of the motif.

Define ρr0 as the probability of character

σr ∈ Σ

in all nonmotif positions. Thus column 0 of the matrix describes the ‘back-
ground’.
For ease of exposition, the row r in the matrix ρ will be replaced by the
character it represents, i.e., σr . Also, we will switch between notation

ρij

and
ρ[i, j],
depending on the context, for ease of understanding. Note that we use column
0 in the ρ matrix to denote the nonmotif or background probabilities of each
character in the alphabet. Then, given the input and the matrix ρ for rows

1 ≤ r ≤ |Σ|

and columns
0 ≤ c ≤ l,
the log of the likelihood is given as
  
Yt ni
Y
F1 = log   ρ[sij , Cij ]
i=1 j=1
 
t
X Xni
=  log(ρ[sij , Cij ]) .
i=1 j=1
218 Pattern Discovery in Bioinformatics: Theory & Algorithms

Let f be a |Σ| × (l + 1) matrix and each entry fσc denotes the number of
positions (given by i and j) in the input with annotation c, i.e.,

Cij = c,

and value σ, i.e.,


sij = σ.

Then
l
XX
F1 = fσc log(ρσc ).
σ∈Σ c=0

Yet another effective measure is by taking a ratio with the background prob-
abilities as follows:
l  
XX ρσc
F2 = fσc log .
ρσ0
σ∈Σ c=1

Thus F1 (and also F2 ) can be computed given

1. the |Σ| × (l + 1) probability matrix ρ and

2. the |Σ| × (l + 1) frequency matrix f .

Consider a concrete example with the solution (alignment) shown below.

Alignment
s1 = AACCT A
s2 = AT GT AGG
s3 = A T A C T A
consensus ACG

Note that this is a very small example and statistical methods work well
for larger data sets. However we use this to explain the formula. Let the
probability matrix ρ be given as follows. Then using the alignment and matrix
ρ, the frequency matrix f can be constructed as follows.
   
0.4 0.9 0.1 0.09 5 3 0 0 A
 0.01 0.04 0.8 0.1  0 0 2 1 C
   
 0.4
ρ= 0.04 0.09 0.8 
, 2
f = 0 0 1 G.
 0.19 0.02 0.01 0.01  3 0 1 1 T
0 1 2 3 0 1 2 3 ←c
Motif Learning 219

Then the two measures are:


l
XX
F1 = fσc log(ρσc )
σ∈Σ c=0
= 5 log 0.4 + 2 log 0.4 + 3 log 0.19
+ 3 log 0.9
+ 2 log 0.8 + log 0.01
+ log 0.1 + log 0.8 + log 0.01
= −10.3773
l  
XX ρσc
F2 = fσc log
ρσ0
σ∈Σ c=1
 
0.9
= 3 log
0.4
   
0.8 0.01
+ 2 log + log
0.01 0.19
     
0.1 0.8 0.01
+ log + log + log
0.01 0.4 0.19
= 3.60625

Note that F1 is almost independent of the number of sequences t and the


length of each sequence ni . Also, this value always improves (increases) with
increasing motif length, as well as the input size. This is the well-recognized
issue of model selection in statistics. One of the effective way of dealing with
this is using the following ‘normalization’ for F1 :
t l
!
X 1 XX X
log + fσc log(ρσc ) + (ni − l) fσ0 log(ρσ0 ) .
i=1
ni − l + 1 c=1
σ∈Σ σ∈Σ

8.4.2 Information content


The relationship between the motif profile and information content of the
motif (with respect to the input) was made by Stormo [Sto88]. For a quick
introduction to information theory, we direct the reader to Exercise 21 in
Chapter 3.
Let s denote the input (which is a collection of t sequences). The informa-
tion content of each position c in the motif profile is defined as
 
X ρσc
Ic (s) = ρσc log ,

σ∈Σ

where fσ is the number of times σ appears in input s. Thus the overall


220 Pattern Discovery in Bioinformatics: Theory & Algorithms

information content of the profile is


l
X
Iρ (s) = Ic (s)
c=1
l  
XX ρσc
= ρσc log .

σ∈Σ c=1

The information content is yet another measure to compare different motif


profiles. The higher the value of Iρ (s), usually the more significant is the
motif profile.

8.5 Algorithms to Learn a Motif Profile


Recall that in this chapter we discuss the learning problem where the motif
occurs exactly once in every input sequence. Let ρ be the motif profile sought.
The motif represented by the profile ρ occurs in all the t sequences suggest-
ing an alignment of l positions in each sequence. The remaining
ni − l
positions in each sequence is often called the nonmotif region or the back-
ground. Let Z denote this occurrence (or alignment) information. The detail
of what Z represents will depend on the particular method in use.
The motif length l and the t input sequences are given. We make the
following assumption.
1. Given ρ, the occurrence/alignment information Z can be estimated.
2. Given Z, motif profile ρ can be estimated.
Then, intuitively, there are two possible iterative schemes to solve the learning
problem as shown in Figure 8.2. We begin with an initial estimate and improve
the result over the iterations. The iterations terminate when the difference
across the iterations is below some δ, that is considered insignificant. In fact,
the actual value of δ depends on the problem domain.
It is not too difficult to see that the method is correct. Does the solution
improve during the course of the iterations? We leave this as an exercise for
the reader (see Exercise 88).

Running time complexity. Let the size of the input be given as


t
X
NI = ni .
i=1
Motif Learning 221

Method 1: Method 2:
1. Initialize ρ0 1. Initialize Z0
2. Repeat 2. Repeat
(a) Re-estimate Z from ρ (a) Re-estimate ρ from Z
(b) Re-estimate ρ from Z (b) Re-estimate Z from ρ
3. Until change in ρ is small 3. Until change in Z is small
FIGURE 8.2: Given the input sequences and the motif length, two possible
learning methods to learn a motif profile.

In each of the methods, each iteration (Steps 2-3) takes time


O (l|Σ|NI ) .
Assuming l and |Σ| to be constants, each iteration takes only linear time. The
number of iterations depend on the data and in practice this is believed to be
small.

Issues with a learning method. It is important to note some of the issues


with a learning method such as the one described above.
1. (pattern length) The length l of the motif (profile) is a fixed parameter.
Thus to find motifs of different lengths, the algorithm must be run
multiple times.
2. (one at a time) The method finds only one motif at a time. Once the
motif is found, the input must be modified to remove the occurrence of
this motif, for subsequent searches for other motifs.
3. What is the best initial estimate?
The solution offered by the method depends on the initial estimate.
Ideally, all possible initial estimates should be tried, but this is not
a practical option.
4. (phase shift) If p[1..l] is the true signal, a learning method may converge
to a subpattern of p, i.e., starting from say the second or third position.
Hence in practice, after the pattern has been found, some further inves-
tigation is done to the left of the pattern to make sure that a subpattern
is not being reported as the pattern.
If such nontrivial issues are associated with discovering profiles, then why use
profiles at all?
One major appealing feature is that it allows some flexibility in the pattern
description. Yet another attractive feature is that it provides a (statistical)
significance value inherently associated with a motif.
222 Pattern Discovery in Bioinformatics: Theory & Algorithms

8.6 An Expectation Maximization Framework


In this section we place Method 1, described above, in an expectation max-
imization framework [BE94]. In fact Lawrence and Reilly [LR90] introduced
the expectation maximization approach as a means for extracting a motif
profile from a given data set.
Under this model, we let Z be represented by z and z is the matrix of
offset probabilities where zij is the probability that the shared motif starts in
position j of sequence i.
In the following sections, we will first describe methods to

1. estimate z, given ρ and

2. estimate ρ, given z.

We begin by discussing the initial estimate used in the algorithm.

8.6.1 The initial estimate ρ0


The method iteratively converges to a profile ρ. However, this depends on
the initial estimate ρ0 . Thus different inital estimates, for the same data set
and the same length l of the motif could discover different motif profiles.
In practice, the initial estimate is taken from a substring in the data. For
example, let the estimate be taken from position j = 4 for a motif profile of
length l = 3 from s1 where

s1 = A G G C T T A G C T G.

The most obvious profile model for this appears to be


 
0.0 0.0 0.0 A
 0.1 0.0 0.0  C
ρ0 =  
 0.0 0.0 0.0  . (8.2)
G
0.0 1.0 1.0 T

However this motif profile will never change over the iterations. The proof is
straightforward and we leave that as an exercise for the reader (Exercise 87).
Thus, in practice, no entry of the ρ0 matrix is set to 0.0. However, since
one entry in the column is biased towards one character, the following is a
good initial estimate for the example above.
 
0.15 0.15 0.15 A
 0.55 0.15 0.15  C
ρ0 = 
  .
0.15 0.15 0.15  G
0.15 0.55 0.55 T
Motif Learning 223

Thus the following is a good rule of thumb to initialize a column c to some


x ∈ Σ. Let L = |Σ|. Then column c is initialized to the following.
 
0.5/L-1
 0.5/L-1  σ1

 ..

 σ2

 . 
 .
 0.5  x
 
 ..  .
 . 
σL
0.5/L-1

Also note that for this column c


L
X 0.5
ρrc = (L − 1) + 0.5
r=1
L−1
= 1.0

8.6.2 Estimating z given ρ


Under a combinatorial model, we have the occurrence at position j of se-
quence i to be either a ‘yes’ or a ‘no’. Under a motif profile model, the
occurrence is the probability zij , thus

0 ≤ zij ≤ 1,

for i and j. This is best explained through a concrete example.


Let sij denote the jth character of the ith input sequence. For ease of
exposition, we let the row r of the matrix ρ to directly denote the character
it represents (A, C, G or T for instance). The probability of occurrence of a
given ρ at position j in sequence si is given by
l
Y

zij = ρ σc c ,
c=1

where
si(j+c−1) = σc .
In other words the sequence si has the character σc at position j + c − 1. This
is a straightforward interpretation of the motif profile. For example, let the
motif profile where l = 5 be given as
 
0.6 0.3 0.05 0.7 0.1 A
 0.2 0.1 0.05 0.1 0.6  C
ρ=   .
0.1 0.5 0.1 0.1 0.1  G
0.1 0.1 0.8 0.1 0.2 T
224 Pattern Discovery in Bioinformatics: Theory & Algorithms

Let j = 4 and the sequence s1 be

s1 = A G G C T T A G C T G.

Then for position j = 4 of sequence s1 ,


5
Y
z1′ 4 = ρ σc c
c=1
= ρ[C, 1] ρ[T, 2] ρ[T, 3] ρ[A, 4] ρ[G, 5]
= (0.2) (0.5) (0.1) (0.7) (0.1)
= 0.0007

Since this is a probability, we normalize this by summing over all the values
in sequence si . Thus, for each i,

zij
zij = Pni −l+1 . (8.3)

zij
j=1

8.6.3 Estimating ρ given z


Consider the scenario where the occurrence probabilities, zij , for all possible
values of i and j are given. Note that j is a function of i and should ideally
be written as ji , but we omit the subscript to avoid clutter. i and j take the
following values:

1 ≤ i ≤ t and
1 ≤ j ≤ (ni − l + 1), for each i.

Also, since zij are probabilities, under the assumption that a motif occurs
exactly once per sequence,3 we assume the following for each i.
niX
−l+1
zij = 1. (8.4)
j=1

For each sequence si , a position xi , 1 ≤ xi ≤ ni − l + 1, is chosen with


probability
zix
Pni −l+1
j=1 zij

3 Under a more general assumption,


0 1
t
X niX
−l+1
@ zij A = 1.
i=1 j=1
Motif Learning 225

Thus the following t positions are chosen randomly:


x1 , x2 , . . . , xt
This suggests an l-wide alignment of the t sequences. A method for estimating
ρ from this alignment is discussed in Section 8.7.1.

Back to the method. We will now place Method 1 in the framework of


expectation maximization.
We use the following observation [DH73]: The likelihood of the profile given
the training data is the probability of the data given the profile. The training
data here refers to sequences that are aligned by the motif. Of course, we
need to find an alignment of the sequences.
Recall Bayes’ theorem (Theorem (3.1)):
P (E2 |E1 )
P (E1 |E2 ) = P (E1 ),
P (E2 )
where events E1 and E2 are in the same probability space and P (E2 ) > 0.
We begin by defining a convenient variable which keeps track of the occur-
rence of the motif (profile) at a position in the input. This position, for each
i is given as
1 ≤ j ≤ ni − l + 1.
Recall that ni is the length of the sequence si and l is the length of the motif.
A motif may or may not occur at this position j. Let Xij be an indicator
variable where

1 if motif starts at position j in si ,
Xij =
0 otherwise.

Further, let ρ(q) denote the estimate of ρ and z (q) denote the estimate of z
after q iterations. Given ρ(q) , the probability of sequence si , given the start
position of the motif (profile), is
  Yl
P si |Xij = 1, ρ(q) = ρ(q)
xc c ,
c=1

where
si(j+c−1) = xc .
Note that using our notation,
 
zij = P si |Xij = 1, ρ(q) .

Using Bayes’ theorem,


P si |Xij = 1, ρ(q) P 0 (Xij = 1)
  
(q)
P Xij = 1|ρ , si = Pni −l+1  , (8.5)
c=1 P si |Xic = 1, ρ(q) P 0 (Xic = 1)
226 Pattern Discovery in Bioinformatics: Theory & Algorithms

where P 0 (Xij = 1) is the prior probability that the motif starts at position j
in sequence si . Since no information is available about the occurrence of the
motif, P 0 is assumed to be uniform. Thus, for each 1 ≤ i ≤ t,
1
P 0 (Xij = 1) = , for 1 ≤ j ≤ ni − l + 1.
ni − l + 1
Then the denominator in Equation (8.5) simplifies to
niX
−l+1  
P si |Xic = 1, ρ(q) P 0 (Xic = 1)
c=1
niX
−l+1  
 1
= P si |Xic = 1, ρ(q)
c=1
ni − l + 1
niX
−l+1  
= P si |Xic = 1, ρ(q) .
c=1

Then Equation (8.5) is rewritten as


 
zij = P Xij = 1|ρ(q) , si
P si |Xij = 1, ρ(q)
  
1
= Pni −l+1 .
ni − l + 1 c=1 P si |Xic = 1, ρ(q)
For each i, the constant above is fixed, and
niX
−l+1
1
zij = ,
j=1
ni − l + 1

hence there no loss by using the following simplification:


P si |Xij = 1, ρ(q)

zij = Pni −l+1 . (8.6)
c=1 P si |Xic = 1, ρ(q)
Under this simplification,
niX
−l+1
zij = 1.
j=1

Notice that Equation (8.6) has the same form as Equation (8.3) of Sec-
tion 8.6.2.
Thus we have shown that Method 1 can be viewed as an expectation max-
imization strategy.
There has been a flurry of activity around this problem [EP02, KP02b].
For instance, Improbizer [AGK+ 04] also uses expectation maximization to
determine weight matrixes of DNA motifs that occur improbably often in the
input data.
Motif Learning 227

8.7 A Gibbs Sampling Strategy


In this section we use Method 2 of Section 8.5 through a Gibbs sampling
strategy [LAB+ 93].
Gibbs sampling is a strategy to generate a sequence of samples from the
joint probability distribution of two or more random variables. This is appli-
cable when the joint distribution is not known explicitly, but the conditional
distribution of each variable is known. The sampling method is to generate
an instance from the distribution of each variable in turn, conditional on the
fixed values of the other variables.
We first define the alignment vector Z of Figure 8.2. Let Z be a t-size
vector where each entry Zi denotes the first position of the motif in si . Note
that here we do not care about the alignment of the nonmotif or background
region of the input.
The initial estimate Z0 is computed as follows. For each sequence si , a
position from 1 to ni is chosen randomly under a uniform distribution.

8.7.1 Estimating ρ given an alignment


This is a straightforward estimation defined as follows. For convenience, we
use the alternate notation for sij as

s[i, j].

For a fixed column c (1 ≤ c ≤ l), define a t-sized vector of characters defined


as
Vc [i] = s[i, Zi + c − 1],
where 1 ≤ i ≤ t. Then
Number of times σ appears in array Vc
ρσc = .
t

Pseudocounts & ρ corrections. However, a value of zero in an entry of


the ρ matrix is not desirable since this value never changes over the iterations
(see Exercise 87). We make the corrections as follows

bσ + Number of times σ appears in array Vc


ρ′σc = .
B+t
where bσ is a σr dependent pseudocount and B is the sum of the pseudocounts.
This also implies that X
ρ′σc = 1,
σ∈Σ

for each c.
228 Pattern Discovery in Bioinformatics: Theory & Algorithms

How do pseudocounts fit into the framework? In Bayesian analysis prior


probabilities are used (which are usually subjective) for the values of the
estimated parameters. A common choice for such priors is the Dirichlet dis-
tribution which amounts to an addition of a pseudocount to the actual counts
as shown above. They are to be interpreted as a priori expectations of the
character occurrences (even possibly different in different positions, i.e., val-
ues of c, though we have treated them as the same for all values of c above)
in the input.

The Gibbs sampling algorithm. We presented above a very general


method of estimating ρ from the alignment. However, under the Gibbs sam-
pling strategy, a sequence, sy , is picked at random and removed from the
collection and the motif probabilities are computed exactly as above with the
remaining t − 1 sequences.
This step is also called the ‘predictive update step’.

8.7.2 Estimating background probabilities given Z


The background or the nonmotif region is computed in the same manner as
ρ is. Note that each sequence si has a l wide motif region and the remaining
is background. Thus each position in the sequence si is

motif region if Zi ≤ j ≤ Zi + l − 1,
background otherwise.
The background probabilities for the input are estimated for each σ ∈ Σ as
bσ + Number of times σ appears in the background
ρ′σ0 = .
B + Size of the background
The background probability can also be computed for each sequence si by
considering the background of only si .

8.7.3 Estimating Z given ρ


This step is also called the ‘sampling step’. For each sequence si and each
position
1 ≤ j ≤ ni − l + 1
in this sequence, we compute the pattern probability Qj and the background
probability Pj and take the ratio
Qj
Aj = .
Pj
Qj is computed as described in Section 8.6.2. In sequence si , the position j
is chosen with probability
A
P j ,
j Aj
Motif Learning 229

and Zj is set to this picked value.

8.8 Interpreting the Motif Profile in Terms of p


A motif is usually identified by its description p, its occurrence and the
number of times, K, it occurs in the input. In our case we have K ′ = t.
However, consider a general case when

K ′ 6= t.

Recall that the task is to discover or recover motif profiles, for some fixed
motif length l given t sequences. what qualifies as a motif profile?

1. The motif has at least K, called quorum, occurrences in the input. In


other words
K ′ ≥ K.

2. An approximation of the probabilities ρij is given by:


µij
ρij ≈ ,
K′
where µij is the number of occurrences where σi occurs in the jth posi-
tion. Notice that by this,

ρij ≤ 1.0, and


r
X
ρij = 1.0,
i=1

for each j.
Next, a single character must dominate significantly in a position, say
j, to specify a ‘solid’ character in the motif. One way of defining this is
as follows: Given some fixed

0<δ<1

there must be some 1 ≤ i′ ≤ r such that

ρi′ j − ρij > δ, for all i 6= i′ .

Then the motif at position j takes the value ρi′ j . If this does not hold,
then that position is defined to be a dont-care.
230 Pattern Discovery in Bioinformatics: Theory & Algorithms

3. How is p defined? For example, consider the following profile.


 
0.6 0.3 0.05 0.7 0.1 A
 0.2 0.1 0.05 0.1 0.6  C
ρ=  0.1 0.5 0.1 0.1 0.1 
 ,
G
0.1 0.1 0.8 0.1 0.2 T

This is interpreted as a motif of length 5 defined as:

AGT AC

Notice that the definition of the occurrence of a motif is intricately associ-


ated with the definition of the motif profile.

8.9 Exercises
Exercise 85 (Statistical measures) What is the relationship between mea-
sure F2 and the information content I, for an input set of sequences and a
motif profile ρ of length l, shown below:
l  
XX ρσc
F2 = fσc log ,
ρσ0
σ∈Σ c=1
l  
XX ρσc
I= ρσc log .

σ∈Σ c=1

Recall that column 0 of ρ stores the background probabilities of the characters;


fσc , c ≥ 1, is the number of times σ appears in position c of the motif oc-
currence in the input sequences; fσ is the number of times σ appears in the
input.

Exercise 86 (Estimating ρ) See Section 8.5 for definition of the terms used
here.
Given z, consider the following scheme for estimating ρ using sequence si .
For each 1 ≤ c ≤ l, define
X
ρσc = zi(j+c−1) .
si [j+c−1]=σ

1. Show that the for each c,


X
ρσc = 1.
σ∈Σ
Motif Learning 231

2. Is it possible to have an entry of zero in the profile matrix? Why?

3. What is the inadequacy of this scheme?

Hint: 1. Use Equation (8.4). 3. Consider the following.

s1 = A C G A A C G G A A
z1j = 0.05 0.2 0.1 0.1 0.05 0.2 0.05 0.1 0.1 0.05

Let the subscript r of ρ denote the character σr . By this scheme,

ρA1 = 0.05 + 0.1 + 0.05 + 0.1 + 0.05 = 0.35


ρC1 = 0.2 + 0.2 = 0.4
ρG1 = 0.1 + 0.05 + 0.1 = 0.25

What is ρA2 , ρA3 ?

Exercise 87 (Initial estimate of ρ)

1. For some r and some c, let

ρrc = 0.0.

Then argue that at all subsequent iterations in the algorithm (both Meth-
ods 1 and 2) ρrc is likely to remain 0.0.

2. For some r and some c, let

ρrc = 1.0.

Then argue that at all subsequent iterations in the algorithm (both Meth-
ods 1 and 2) ρrc is likely to remain 1.0.
In other words the motif is likely to have σr at position c in all iterations.

3. Let the initial estimate ρ0 in the Expectation Maximization approach be


as follows.  
0.0 0.0 0.0 A
 0.1 0.0 0.0  C
ρ0 =  
 0.0 0.0 0.0  G
0.0 1.0 1.0 T

Then argue that at all iterations q ρ(q) is likely to be ρ0 . In other words,


at each iteration the motif is likely to be

C T T.
232 Pattern Discovery in Bioinformatics: Theory & Algorithms

Hint: 1. See the update procedures. 2. Note that all the other entries in
that column must be zero. 3. From 1 & 2.

Exercise 88 (Proof of convergence) In the iterative procedure (both Meth-


ods 1 and 2), can you argue that the solution improves over the iterations? In
other words, ′
F (ρ(q) ) > F (ρ(q ) ),
is likely to hold for a measure F and iteration q > q ′ .
Let F be defined to be F1 , F2 or information content I of Section 8.4.
Hint: How is ρ or the alignment Z estimated? How likely is it that the F
value decreases over an iteration?

Exercise 89 (Σ size) Discuss the effect of alphabet size |Σ| on the learning
algorithm.
Hint: The number of nucleic acids is 4 and the number of amino acids is 20.
So can a system that discovers transcription factors in DNA sequences also
discover protein domain motifs? Why? What parameters need to change?
What about a binary sequence?

Exercise 90 (Multiple occurrences) Discuss the main issues involved in


allowing multiple occurrences of a motif in a sequence in the input data.
Hint: For l = 3, see occurrences below.
s1 = A A C C T A

s2 = A T G T A G G

s3 = A T A C T A
This gives two possible alignments:
Alignment 1 Alignment 2
s1 AACCT A s1 AACCT A
s2 AT GT AGG s2 AT GT AGG
s3 AT AC T A s3 AT AC T A
motif? ACG motif? ACG
If s3 also had 2 occurrences, how many alignments could there be? In the
worst case, how many alignments are possible? How is the multiplicity of this
kind incorporated in the probability computations?
Motif Learning 233

Exercise 91 (Generalizations)
1. Discuss how Expectation Maximization (Method 1) presented in this
chapter can be generalized to handle multiple occurrences in a single
sequence of the input.
2. Discuss how the Gibbs Sampling approach (Method 2) can be generalized
to handle multiple occurrences in a single sequence of the input.
3. Can the methods be extended to incorporate unsupervised learning? Why?
Hint: 1. How should z be updated? 2. What should Z0 be ? How is
Z0 updated? 3. One of the major difficulties is in guessing in which of the
sequences the motif is absent while estimating ρ or Z.

Exercise 92 (Substitution, scoring, Dayhoff, PAM, BLOSUM ma-


trices) A substitution matrix or a scoring matrix M is such that, value M [i, j]
is used to score the alignment of residue i with residue j in an alignment of
two protein sequences. Two such matrices are discussed below.
1. Margaret Dayhoff and colleagues developed the PAM (Percent Accepted
Mutation) series of matrices. In fact Margaret pioneered protein com-
parisons and data basing and developed the model of protein evolution
encapsulated in a substitution matrix. This matrix is also called the
Dayhoff matrix.
(a) The values are derived from global alignments of closely related
sequences.
(b) Matrices for greater evolutionary distances can be extrapolated from
those of smaller ones.
(c) The number with the matrix, such as PAM60, refers to the evolu-
tionary distance; the larger the number, the greater the distance.
2. Steve Henikoff and colleagues developed the BLOSUM (BLOcks SUbsti-
tution Matrix) series of matrices.
(a) The values are derived from local (ungapped) alignments of dis-
tantly related protein sequences.
(b) All matrices are directly calculated, no extrapolation is used.
(c) The number with the matrix, such as BLOSUM60, refers to the
minimum percent identity of the blocks used to construct the matrix;
the larger the number, the smaller the distance.
What is the relationship between substitution matrix M and the probability
matrix ρ of Equation (8.1), if any? Is M stochastic? Is M symmetric?
Why?
Chapter 9
The Subtle Motif

While there is algorithms,


there is hope.
- anonymous

9.1 Introduction: Consensus Motif


As we saw in Chapter 8 the problem of detecting common patterns across
biopolymers such as DNA sequences for locating interesting segments such
as regulatory sites, transcription binding factors or even drug target binding
sites, is indeed a challenging problem.
In this chapter we discuss a combinatorial modeling of the same problem:
the signal is viewed as a consensus segment of the different sequences. The
inadequacy, in a sense, of the learning methods and a rationale for the use of
combinatorial model by the community, is discussed in the next section.
However, even in the combinatorial framework, not surprisingly the problem
continues to be challenging. The main difficulty is that these motifs have
subtle variations at each occurrence and also the location of the variation
may differ at each occurrence. Nevertheless there is enough commonality to
qualify this segment as a motif. This commonality is often known as a subtle
motif.
One of the advantages of using this model over a learning model is that
a combinatorial model is more suitable for dealing with the signal occurring
over different length segments in the input (due to say insertion or deletion
of bases). See an example in Figure 9.1 taken from [CP07] of a transcription
binding factor. The different values in the pos column show that the alignment
moves the sequences around.1 Only the segments that contain the signal is
shown in the figure. Observe how the signal differs at each occurrence. The
‘-’ in the alignment corresponds to a gap.

1 See the cited paper for any further details on this example.

235
236 Pattern Discovery in Bioinformatics: Theory & Algorithms

No pos Predictions M I
0 −101 T G A C G T C A 1
1 −299 T G C − G T C A 1
2 −71 T G A C A T C A 1 1
3 −69 A T G A − G T C A G 2
4 −527 T G C G A T G A 2 1
6 −173 T G A − C T A A 2
7 −1595 T G A − A T G A 2
8 −221 T G G − G T C T 2
9 −69 T G A − C T G C 3
10 −105 T G A − A T C A 1
12 −780 T G C − G T C A 1
14 −1654 A T G A − A T C A 1 1
15 −69 A T G A − G T C A A 2
16 −97 T G A − G T A A 1
17 −1936 A T G A − A T C A 1 1
signal TGA GTCA
FIGURE 9.1: An example of a subtle motif (signal) as a transcription
binding factor in human DNA. Notice that at each occurrence the motif is
some ‘edit distance’ away from the consensus signal. The edit operations are
mutation (M) and insertion (I): see Section 9.3 for details on edit distance.

9.2 Combinatorial Model: Subtle Motif


The subtle motif problem has been of interest to both biologists and com-
puter scientists. A satisfactory practical solution has been elusive although
the problem is defined very precisely:

Problem 9 (The consensus motif problem): Given t sequences si on


an alphabet Σ, a length l > 0 and a distance d ≥ 0, the task is to find all
patterns p, of length l that occur in each si such that each occurrence p′i on si
has at most d mismatches with p.
The problem in this form made its first appearance in 1984 [WAG84]. In this
discussion, the alphabet Σ is

{A, C, G, T }

and the problem is made difficult by the fact that each occurrence of the
pattern p may differ in some d positions and the occurrence of the consensus
pattern p may not have
d=0
in any of the sequences.
The Subtle Motif 237

In the seminal paper [WAG84], Waterman and coauthors provide exact


solutions to this problem by enumerating neighborhood patterns, i.e., patterns
that are at most d Hamming distance from a candidate pattern. Sagot gives a
good summary of the (computational) efforts in [Sag98] and offers a solution
that improves the time complexity of the earlier algorithms by the use of
generalized suffix trees. These clever enumeration schemes, though exact,
have a drawback that they run in time exponential in the pattern length.
Some simple enumeration schemes are discussed in Section 9.6.
How do the methods discussed in Chapter 8 fit in? As we have seen this
problem of detecting common subtle patterns across sequences is of great in-
terest. Various statistical and machine learning approaches, which are inexact
but more efficient, have been proposed in literature [LR90, LAB+ 93, BE94,
HS99]. One of the questions that can be asked to compare and test the efficacy
of such motif discovery systems is:

Given a set of sequences that harbor (with mutations) k motifs, what


percentage of the k motifs does the system recover?

When k is large, the learning methods of Chapter 8 recover a large percentage


of the k embedded motifs. Yet another question to ask is:

Given a set of sequences that harbor (with mutations) ONE motif p,


does the system recover p?

This is a rather difficult criterion to meet since the learning algorithms use
some form of local search based on Gibbs sampling or expectation maximiza-
tion. Hence it is not surprising that these methods may miss p.
However, a question of this form is a biological reality. Consider the fol-
lowing, somewhat contrived, variation of Problem 9 which is an attempt at
simplifying the computational problem.

Problem 10 (The planted (l, d)-motif problem): Given t sequences


si i on Σ, a pattern p of length l is embedded in s′i , with exactly d errors
(mutations), to obtain the sequence si of length n, for each 1 ≤ i ≤ t.
The task is to recover p, given si , 1 ≤ i ≤ t, and the two numbers l and d.

Pevzner and Sze [PS00] made the question more precise and provided a bench-
mark for the methods, by fixing the following parameters:

1. n = 600 (length of each input sequence is 600),

2. t = 20 (the number of sequences is 20),

3. l = 15 (the length of the signal or motif is 15) and

4. d = 4 (the number of mutations is exactly 4 in each embedding).


238 Pattern Discovery in Bioinformatics: Theory & Algorithms

A solution to this apparently simplified problem was so difficult, that this was
dubbed the challenge problem.
This chapter discusses methods that solves problems of this flavor. This
formalization, in a sense, is the combinatorial version of the problem discussed
in Chapter 8. A further generalization, along with a method to tackle it, is
presented in the concluding section of the chapter.
We first clarify the different ‘motifs’ used in this chapter. The central goal
is to detect the consensus or the embedded or the planted motif in the given
data sets which is also sometimes referred to as the signal in the data or the
subtle signal. When a motif is not qualified with these terms, it refers to a
substring that appears in multiple sequences, with possible wild cards.

9.3 Distance between Motifs


We begin by going through some very basic definitions of distance between
motifs. Let Σ be the alphabet on which the input string s, as well as any
pattern p is defined.

Hamming distance. Given two patterns (or strings) p1 and p2 of length l


each, the Hamming distance is defined as the number of positions 1 ≤ j ≤ l
such that
p1 [j] 6= p2 [j].
For example, consider
p1 = A C A T G
p2 = A T A T A
X X
The Hamming distance between p1 and p2 is given as

Hamming(p1 , p2 ) = 2,

since
p1 [1] = p2 [1], p1 [3] = p2 [3], and p1 [4] 6= p2 [4]
but
p1 [2] 6= p2 [2] and p1 [5] 6= p2 [5]
marked as ‘X’ in the alignment.

Edit distance. Given a pattern (or string) p1 various edit operations can
be performed on p1 to produce p2 . Here we describe three edit operations
The Subtle Motif 239

that are effective on a position j on p1 as follows.2 The edit distance between


p1 and p2 is written as:
eDis(p1 , p2 ).
1. Mutation (M): p1 [j] is changed to some

σ(6= p1 [j]) ∈ Σ.

For example if p1 = A C C then mutation at j = 2 can give any one of


the following as p2

p2 = A G C or A T C or A A C.

In this case, we write


eDis(p1 , p2 ) = 1.

2. Deletion (X): p1 [j] is deleted. For example if p1 = A C C then deletion


at j = 2 gives p2 as
p2 = A C.
In this case, we write
eDis(p1 , p2 ) = 1.

3. Insertion (I): A σ ∈ Σ is inserted at position j. For example if p1 =


A C C then insertion at j = 2 can give any one of the following as p2

p2 = A A C C or A C C C or A G C C or A T C C.

In this case, we write


eDis(p1 , p2 ) = 1.

Note that a deletion or insertion results in a different length of the pattern,


i.e.,
|p1 | 6= |p2 |.
To summarize, we define the distance between two substrings (or motifs)
as either Hamming or an edit distance. Note that the latter is more general,
since the Hamming distance usually assumes that p1 and p2 are of the same
length. This is written as
dis(p1 , p2 ),
and the context will define whether this distance is Hamming or an edit dis-
tance.

2 Itis possible to have edit operations defined on a segment of the string instead of a single
location j. For example inversion on position 2-4 can transform
p1 = G A C T C to p2 = G T C A C.
240 Pattern Discovery in Bioinformatics: Theory & Algorithms

Let the occurrence of a motif in the input be o. For a given length l and d
(< l), a motif p is a subtle motif if at each occurrence, o, in the input

dis(p, o) ≤ d.

For example let l = 5 and d = 2 and the three occurrences in the sequences
are shown below.

s1 = T A T C C T

s2 = ACTCA C

s3 = C T C C A A

subtle motif p = TCTCA

Notice that in sequence 1, the motif differs at positions 2 and 5; in sequence


2 at position 1; and at positions 3 and 4 in sequence 3.
Let the occurrences in the three sequences (shown boxed above) be termed
o1 , o2 and o3 . At each occurrence the motif differs in at most d (=2) positions.
This is also written as:

dis(p, 01 ) = 2 ≤ d,
dis(p, 02 ) = 1 ≤ d,
dis(p, 03 ) = 2 ≤ d.

Thus one must look at all the occurrences to infer what the motif must be.

9.4 Statistics of Subtle Motifs


A subtle motif is defined by its length parameter l and the edit distance
d with which it occurs in the given t sequences. For different values of these
parameters, does it leave some detectable clues behind? We explore this by
studying the statistics of subtle motifs in a very general setting.
We recall some basic definitions here. Given t sequences of length l each, a
pattern satisfies quorum K if it occurs in

K′ ≥ K

of the given t sequences. Further it is of maximal size h, if in each of the K ′


occurrences, the size cannot be increased without decreasing the number of
occurrences K ′ .
The Subtle Motif 241

For simplicity, the sequences are the same length l and all the t sequences
are aligned and we will further assume that a pattern occurs at most once in
each sequence.
Given a motif, let the embedded signal in each sequence be constructed
with some d edit operations. Given one of these edit operations, we assume

1. mutation (M), with probability of mutation given as qM ,

2. deletion (X), with probability of deletion given as qX and

3. insertion (I), with probability of insertion given as qI .

Since the only permissible edit operations are these three,

qM + qX + qI = 1.

The model. We consider the following simplified model. Given a fixed pattern
(or signal),
psignal ,

of length l, we construct t sequences from psignal . To construct each sequence,


d positions in the pattern psignal are picked at random and an edit operation
(mutation with probability qM , deletion with probability qX and insertion
with probability qI ) is applied to produce the sequence. Then we study these
t sequences.
In other words, given t sequences we assume that they are aligned. For
example, the table below on the left shows exactly one edit applied to the
signal motif and the table on the right shows the alignment of these embedded
motifs.

Edits signal = ACGTAC Alignment


M ACGTCC A C C−T c C
X AGTAC A − G−TAC
I ACGATA C A C G a TAC
M ACCTAC A C c −TAC
M GCGTAC g CG−TAC

Assume that d out of the l positions are picked at random on the embedded
motif for exactly one of the edit operations, insertion, deletion or mutation.
l can be viewed as the size of the motif. Recall that we assume that the
sequences are correctly aligned. Then if a position in the aligned sequence is
a mismatch, then either it is due to

1. a mutation (whose probability is qM ) or

2. an insertion (whose probability is qI ).


242 Pattern Discovery in Bioinformatics: Theory & Algorithms

Then the probability of this position to be a dot character (mismatch) is

d
(qM + qI ) .
l
Next, the probability q of a position to be a solid character in a motif is:
d
q =1− (qM + qI ) . (9.1)
l
q for three scenarios is shown below.
qM qX qI q
1) Exactly d mutations 1 0 0 1 − d/l
2) Exactly d edits 1/3 1/3 1/3 1 − 2d/3l
Exactly d edits with,
3) 1/2 1/4 1/4 1 − 3d/4l
equiprobable indel and mutation
When no more than d′ edit operations are carried out on the embedded
motif, it is usually interpreted as each collection of

0, 1, 2, . . . , d′

positions being picked with equal probability, and thus

d′
d=
2
for Equation (9.1).

Estimating the probability of occurrence of a motif. Recall that q is


the probability of a position (character) in the input data to match a character
in the pattern (signal). Let H be the number of solid characters and let the
motif appear in at least K sequences.
For instance in the following alignment, for the first 4 rows, i.e., k = 4, the
pattern has H = 3 solid characters, namely A, T and C, shown in bold at the
bottom row. In other words, in these four rows, the solid characters appear
in each row of the aligned sequences.

Alignment Pattern
ACG−T c C A CG−T c C←
A−G−TAC A −G−TAC←
ACG a TAC A CG a TAC←
AC c −TAC A C c −TAC←
g CG−TAC g CG−TAC
A T C k

For a pattern p with some H solid characters, let p occur in some k sequences
(and not in the remaining (t − k) sequences). Then
The Subtle Motif 243

(a)

(b)

FIGURE 9.2: For t = 20, l = 20, the expected number of maximal motifs
E[ZK,q ], is plotted against (a) quorum K shown along the X-axis (for different
values of q which are close to 1.0), and, (b) against q shown along the X-axis
(for different values of quorum K).

1. the probability of matches in the (aligned) H solid characters in the k


rows is
qH k ,
and
2. the probability of at least one mismatch in the (aligned) H positions in
the remaining (t − k) sequences, is
t−k
1 − qH .

Thus the probability of occurrence of pattern p is given by


t−k H k
1 − qH q . (9.2)
244 Pattern Discovery in Bioinformatics: Theory & Algorithms

FIGURE 9.3: For t = 20, l = 20, the expected number of maximal motifs
E[ZK,q ], is plotted against quorum K shown along the X-axis, for different
values of q, in a logarithmic scale. Unlike the plot in Figure 9.2, the value of q
here varies from 0.25 to 1.0. Notice that when q = 1, the curve is a horizontal
line at y = 1. Note that for DNA sequences, q = 0.25 corresponds to the
random input case.

If Ep denotes the event that p occurs in some fixed k sequences then for any
two distinct events, i.e.,
p1 6= p2 ,

Ep1 and Ep2 are not necessarily mutually exclusive. However, if the pattern
is maximal, i.e., H is the maximum number of solid characters seen in the
k sequences, then for a fixed set of k sequences, there is at most one maxi-
mal pattern that occurs in these k sequences and not in the remaining t − k
sequences. Further, when the pattern is maximal there is a guarantee of mis-
match in the remaining (l − H) positions in all the k rows and the probability
of this mismatch is given as

(1 − q k )l−H . (9.3)

Thus to summarize, the probability of occurrence of some pattern with exactly


H solid characters in exactly k sequences is given by
t−k
1 − qH q H k (1 − q k )l−H . (9.4)

Thus if
Pmaximal (K, H, q)
The Subtle Motif 245

is the probability that some maximal pattern with H solid characters and
quorum K occurs in the input data, then using equation (9.4),
t  
X t t−k H k
Pmaximal (K, H, q) = 1 − qH q (1 − q k )l−H . (9.5)
k
k=K

Let
ZK,q
be a random variable denoting the number of maximal motifs with quorum
K and q as defined above, and,
E[ZK,q ]
denotes the expectation of ZK,q . Using linearity of expectations (for a fixed t
and l),
l  
X l
E[ZK,q ] = Pmaximal (K, h, q)
h
h=1
l   t  
!
X l X t h t−k h k
 k l−h
= 1−q q (1 − q ) .
h k
h=1 k=K

Now, it is rather straightforward to estimate E[ZK,q ] given different values of q


corresponding to different scenarios. Figures 9.2 and 9.3 show some examples.

9.5 Performance Score


Next, we define measures to evaluate the predictions of the subtle signal.
We describe two simple measures that are commonly used [TLB+ 05, CP07].
Note that unlike in the motif learning process, these measures are not designed
to drive the algorithm but to give a quantitative measure of how good the
predictions are postmortem. This is used on data (possibly benchmark) where
the correct solution is known either by other means or simply through the
knowledge of the data construction process.
1. Let P be the set of all positions covered by the prediction and S be the
same set for the embedded motif. The score of the prediction P , with
respect to the embedded motif, can be given as:
|P ∩ S|
score = .
|P ∪ S|
The score is 1 if the prediction is 100% correct. However, even for values
much smaller than one, the embedded motif may be computed correctly.
However, this measure is rather stringent.
246 Pattern Discovery in Bioinformatics: Theory & Algorithms

2. The solution coverage (SC) is defined as the number of sequences that


contains at least one occurrence of the predicted motif whose distance
from the prediction is within the problem constraint, i.e., bounded by d.
Again if the coverage is equal to the total number of sequences t, then
the prediction can be considered 100% correct.

9.6 Enumeration Schemes


We first study some enumeration schemes to detect the subtle signal. Given
t input sequences and a motif length l with distance d, the task is to detect
the subtle signal in the input.
We next discuss the estimation of the set of potential signals Csignal . It is
tempting to use an exact algorithm when the alphabet size is not too large.

9.6.1 Neighbor enumeration (exact)


An obvious method is to generate a set of potential signals, Csignal , which
is the set of all possible l-mers. It is easy to see that, in this case,

|Csignal | = |Σ|l .

Next each p ∈ Csignal is checked against the input sequences.


Note that in this naive scheme, Csignal is independent of the input se-
quences. An easy improvement over such a blatant enumeration scheme is to
generate a more restricted version of Csignal that depends on the input. Here
we describe such an enumeration scheme.

Step 1 (Computing C1 , C2 , . . . , Ct ). For each input sequence si

Ci = {p | p is a substring of length l in si }.

This can be obtained by a single scan of si from left to right, and at each
location j, extracting a pattern p as

p = s[j . . . (j + l − 1)].

and then removing the duplicates. An auxiliary information, j, the location


of p on the sequence si , is associated with p.
Next, it is easy to see that, for each sequence si ,

|Ci | ≤ |si | − l + 1. (9.6)


The Subtle Motif 247

Step 2 (Computing C′1 , C′2 , . . . , C′t ). For each p ∈ Ci construct the ‘neigh-
borhood’ patterns as follows:

N ebor(p, d) = {p′ | Hamming(p, p′ ) = d}. (9.7)

Next for each Ci , construct Ci′ :

Ci′ = {p′ | p′ ∈ N ebor(p, d) where p ∈ Ci }.

The auxiliary information associated with each p′ is

{p1 , p2 , . . . , pr } ⊂ Ci ,

which is the set of r patterns such that p′ is at distance d from each of these
patterns.
What is the size of each Ci′ ? The number of positions that are mutated in
the pattern p is d, thus the number of distinct patterns with some mutations
is these positions is no more than
 
l
.
d

Further, if the original value at one of the positions is σ, then it can take any
value from the set
Σ \ {σ}.
Thus the total number of distinct patterns at a distance d from a pattern is
no more than  
l
(|Σ| − 1)d . (9.8)
d
Using Equations (9.6) and (9.8),
 
l d
|Ci′ | ≤ |Ci | (|Σ| − 1)
d
 
l d
≤ (|si | − l + 1) (|Σ| − 1)
d
   
l d
= O |si | |Σ| .
d

Step 3 (Computing Csignal ). It is easy to see that the possible embedded


patterns (signals) are given by the set Csignal :

Csignal = C1′ ∩ C2′ ∩ . . . ∩ Ct−1 ∩ Ct′ .

For each
p′′ ∈ Csignal ,
248 Pattern Discovery in Bioinformatics: Theory & Algorithms

there exists some


p 1 ∈ C 1 , p 2 ∈ C 2 , . . . , pt ∈ C t ,
such that p′′ is at distance d from each of the t patterns. In turn, each of
these t patterns, corresponds to some location(s) ji on each sequence si

j1 , j2 , . . . , jt .

Thus it can be said that signal p′′ is embedded at location j1 in s1 , j2 in s2 ,


. . ., jt in st .
We use the following example to illustrate the method.

Example 2 The input is defined as follows:

s1 = T A T C C, Σ = {A, T, C},
s2 = A C T C A, t = 3,
s3 = C T T T C, l = 4 and d = 1.

Step 1 on the example. The Ci sets are computed as follows:

C1 = {T AT C, AT CC},
C2 = {ACT C, CT CA},
C3 = {CT T T, T T T C}.

Step 2 on the example. The Ci′ sets are computed as follows:

C1′ = {N ebor(T AT C, 1), N ebor(AT CC, 1)}


N ebor(T AT C, 1) = {AAT C, CAT C, T T T C, T CT C,
T AAC, T ACC, T AT A, T AT T },
N ebor(AT CC, 1) = {AT CA, AT CT, AT AC, AT T C,
AACC, ACCC, CT CC, T T CC}.

C2′ = {N ebor(ACT C, 1), N ebor(CT CA, 1)},


N ebor(ACT C, 1) = {ACT A, ACT T, ACAC, ACCC,
AAT C, AT T C, T CT C, CCT C},
N ebor(CT CA, 1) = {CT CT, CT CC, CT AA, CT T A,
CACA, CCCA, AT CA, T T CA}.
The Subtle Motif 249

C3′ = {N ebor(CT T T, 1), N ebor(T T T C, 1)}


N ebor(CT T T, 1) = {CT T A, CT T C, CT AT, CT CT,
CAT T, CCT T, AT T T, T T T T },
N ebor(T T T C, 1) = {T T T A, T T T T, T T AC, T T CC,
T AT C, T CT C, AT T C, CT T C}.

Step 3 on the example. The set of potential signals is estimated as follows:

Csignal = C1′ ∩ C2′ ∩ C3′


= {T CT C, AT T C}.

Since
 
 N ebor(T AT C, 1),  N ebor(AT CC, 1),
T CT C ∈ N ebor(ACT C, 1), and AT T C ∈ N ebor(ACT C, 1),
N ebor(T T T C, 1), N ebor(T T T C, 1),
 

this gives two alignments of the input si ’s as shown below.

Using T CT C : Using AT T C :
s1 = T AT CC s1 = T A T C C
s2 = ACT C A s2 = AC T CA
s3 = C T T T C s3 = C T T T C
consensus T C T C consensus A T T C

This shows that there are two embedded signals,

T C T C and A T T C,

that satisfy the l = 4, d = 1 constraints.


This concludes the discussion on exact neighbor enumeration. See [RBH05]
for more details on this approach and some practical implementations of the
exact approach.

9.6.2 Submotif enumeration (inexact)


The central observation used in this approach is as follows. The submotif
of the embedded signal (motif ) occur in the input more often than a randomly
selected submotif. The method is inexact since some solutions can be missed
as we will see in the concrete example below.
The method works in the following steps. Given l, fix some k (< l).

1. Randomly pick k out of l positions to create a mask.

2. A mask is used to pick up k-mers from the input sequences.


250 Pattern Discovery in Bioinformatics: Theory & Algorithms

3. Certain masks occur in multiple sequences, suggesting a local multiple


alignment. By the observation that a random submotif is not likely to
occur too many times, this can be used to extract the signal (subtle
motif).

Consider a concrete example where l = 4. We discuss cases k = 2 and k = 3


for a sample input below.

Case k = 2. Step 1. If the k position picked at random are 2 and 4, then


the mask takes a value 1 in positions 2, 4 and ‘dont-care’ in the remaining
positions, encoded as
.1.1

The exhaustive list of masks for parameters l = 4, k = 2 is shown below:

mask1 =1 1 . .
mask2 =1 . 1 .
mask3 =1 . . 1
mask4 = . 11 .
mask5 = . 1 . 1
mask6 = . . 1 1

Step 2. A mask is used to pick up k-mers from the input sequences. An


l-length mask can be used on a n-length sequence

n − (l + 1)

times. For example, the 2-mers picked for mask1 and mask2 are shown below.

s1 T A T C C T A T C C s1 T A T C C T A T C C
mask1 1 1 . . 1 1 . . mask2 1 . 1 . 1 . 1 .
2-mers T A AT 2-mers T T A C

The complete list of 2-mers picked by the masks are listed below.

s1 s2 s3
T AT C C AC T C A C T T T C
mask1 T A..; AT.. AC..; CT.. CT..; T T..
mask2 T.T.; A.C. A.T.; C.C. C.T.. T.T.
mask3 T..C; A..C A..C; C..A C..T ; T..C
mask4 .AT.; .T C. .CT.; .T C. .T T.; .T T.
mask5 .A.C; .T.C .C.C; .T.A .T.T ; .T.C
mask6 ..T C; ..CC ..T C; ..CA ..T T ; ..T C
The Subtle Motif 251

Step 3. The local alignments suggested by some of the masks are shown below.

l-mers & alignment


no. of support
s1 s2 s3 
s1 = T A T C C
T.T. 1 +0 +1 (I)
 s3 = C T T T C
T..C 1 +0 +1

s1 s2 s3 s1 = T A T C C
(II)
.T.C 1 +0 +1 s3 = C T T T C

s1 s2 s3 s1 = T A T C C
(III)
.T C. 1 +1 +0 s2 = A C T C A

 s1 = T A T C C
s1 s2 s3
s2 = A C T C A (IV)
..T C 1 +1 +1
s3 = C T T T C

s1 s2 s3
− (V)
C.C. 0 +2 +0

Consensus alignment of the three sequences give the signal as shown below.

Using alignments (I), (III), (IV):


s1 (v11 ) = T A T C C
s2 (v21 ) = A C T C A
s3 (v32 ) = C T T T C
consensus T C T C

Case k = 3. Next, consider the same example with k = 3.


Step 1. The exhaustive list of masks for parameters l = 4, k = 3 is shown
below:
mask1 = 1 1 1 .
mask2 = 1 1 . 1
mask3 = 1 . 1 1
mask4 = . 1 1 1

Step 2.
s1 s2 s3
T AT C C AC T C A CTTTC
mask1 T AT.; AT C. ACT.; CT C. CT T.; T T T.
mask2 T A.C; AT.C AC.C; CT.A CT.T ; T T.C
mask3 T.T C; A.CC A.T C; C.CA C.T T ; T.T C
mask4 .AT C; .T CC .CT C; .T CA .T T T ; .T T C
252 Pattern Discovery in Bioinformatics: Theory & Algorithms

Step 3.
l-mers & alignment
no. of support 
s1 s2 s3 s1 = T A T C C
T.T C 1 +0 +1 s3 = C T T T C
Notice that this does not extract the l length signal. It just extracts the
following
T.TC
This example illustrates the fact that this enumeration is inexact, since one
of the solutions is missed in Case 2. It is also quite possible that a solution is
missed in all possible values of k.

9.7 A Combinatorial Algorithm


The enumeration scheme, though exact, is very compute-intensive. For
most real data, it is prohibitively time consuming to use such an exact method.
Here we describe a combinatorial approach [PS00] (Winnower) that reduces
the given problem to finding cliques in a t-partite graph.
A graph G(V, E) is t-partite, if the vertex set can be partitioned as

V = V1 ∪ V2 ∪ . . . ∪ Vt ,

where for i 6= j, 1 ≤ i, j ≤ t,

Vi ∩ Vj = ∅,

and for
each pair vi1 , vi2 ∈ Vi , vi1 vi2 6∈ E holds.
In other words, the vertex set can be partitioned into t (nonintersecting) sets
such that the edges go across the sets but not within each set.
Given a graph G(V, E), a subgraph

G(V ′ ⊂ V, E ′ ⊂ E),

is a clique if for

each pair v1 , v2 ∈ V ′ , v1 v2 ∈ E ′ holds.

In other words, a clique is a subgraph where every two vertices is connected


by an edge.
It is important to note that the problem of finding a t-sized clique in a
t-partite graph is NP-complete. Then why reduce the given problem to yet
The Subtle Motif 253

another difficult problem? The clique finding problem is well studied and
various heuristics have been designed to effectively solve the problem and we
wish to exploit these insights to solve our problem at hand. However, to avoid
digression we do not discuss the details of solving the clique problem here.

Back to Problem 1. We begin with the following observation:


If
d = Hamming(p, p1 )
= Hamming(p, p2 ),
then
2d ≥ Hamming(p1 , p2 ).
The proof is straightforward and is left as an exercise for the reader. How is
this fact used to solve the problem?
In this approach, we take an l-mer at some position j on si and associate this
with an l-mer on some position j ′ on si′ , i′ 6= i for each i′ when it is plausible
that the two l-mers are at a distance d from some hypothetical signal p. Since
we do not know what p is, we use the above observation to associate the two
l-mers only when the distance between them is no more than 2d. Next, we
seek a collection of t l-mers where every pair is associated with each other and
each l-mer is from a distinct input sequence. In other words, if each l-mer is
a vertex in a graph and the pair-wise association is an edge, we seek a clique
of size t.
Formally put, a t-partite graph G(V, E) is generated in the following two
steps:
1. Each distinct l-mer in si , given by
si [j . . . (j + l − 1)],
is mapped to a vertex vij . Each partition Vi is defined as
Vi = {vij | for some j},
and
V = V1 ∪ V2 ∪ . . . ∪ Vt .
Also note that
Vi1 ∩ Vi2 = ∅,
for i1 6= i2 .
2. For vi1 j1 ∈ Vi1 and vi2 j2 ∈ Vi2 ,
vi1 j1 vi2 j2 ∈ E
if and only if
Hamming(si1 [j1 . . . (j1 + l − 1)], si2 [j2 . . . (j2 + l − 1)] ≤ 2d.
254 Pattern Discovery in Bioinformatics: Theory & Algorithms
1

2 2
TATC ACTC TTTC

2 2
ATCC CTCA CTTT

2
FIGURE 9.4: The tripartite graph with each partition (of vertices) ar-
ranged along a column. The two cliques are the top row and bottom row
respectively of vertices.

It is easy to see that G(V, E) is a t-partite graph. Next the task is to find all
cliques of size t in the graph.
Each such clique gives an alignment of the input sequences and a consensus
motif p. This p is checked to see if the problem constraints are satisfied.

Example (2). The 3-partite graph is constructed as follows (see Figure 9.4).
The vertex set is
V = V1 ∪ V2 ∪ V3 ,
where each Vi is defined as follows.

1. V1 = {v11 , v12 },
where T AT C is mapped to v11 and AT CC is mapped to v12 .

2. V2 = {v21 , v22 },
where ACT C is mapped to v21 and CT CA is mapped to v22 .

3. V3 = {v31 , v32 },
where CT T T is mapped to v31 and T T T C is mapped to v32 .

The following upper diagonal matrix shows the Hamming distance between
two l-mers mapped to the two vertices. Each nonzero distance gives an edge
in the graph.

v11 v12 v21 v22 v31 v32


(T AT C) (AT CC) (ACT C) (CT CA) (CT T T ) (T T T C)
v11 (T AT C) 2 0 0 1
X
v12 (AT CC) 2 2 0 2
v21 (ACT C) 0 2
X
v22 (CT CA) 2 0
v31 (CT T T )
X
v32 (T T T C)
The Subtle Motif 255

The two cliques in this graph are:

1. Clq1 = {v11 , v21 , v32 } and

2. Clq2 = {v12 , v21 , v32 }.

The two alignments corresponding to the two cliques are:

Using Clq1 : Using Clq2 :


s1 (v11 ) = T A T CC s1 (v12 ) = T A T C C
s2 (v21 ) = A C T CA s2 (v21 ) = A C T CA
s3 (v32 ) = C T T T C s3 (v32 ) = C T T T C
consensus T C T C consensus A T T C

Thus there are two embedded signals,

T C T C and A T T C,

that satisfy the l = 4, d = 1 constraints.


This concludes the discussion on the combinatorial approach of this section.
Other possible approaches can be based on enumerating possible patterns
and checking their candidacy for being the subtle pattern using clever heuris-
tics and an exhaustive search in a reduced space. Similar approaches, with
different heuristics are presented in [PRP03, KP02a, EP02].

9.8 A Probabilistic Algorithm


What is the issue with the combinatorial approach? Recall that the clique
computation is not easy. Most of the times this approach fails on nontrivial
data since the heuristics are unable to extract the clique. The trouble is with
enumerating (scanning) the solution space.
So we pursue a method that randomly samples this solution space. However
to be effective, we carefully orchestrate the sampling as described below. The
method is named Projections by its creators/designers Buhler and Tompa.
We first define a condensed submotif. Given a motif p, a submotif of p is a
sequence with one or more matching characters with p. If k characters match
with p then a k-mer is obtained by simply removing the nonmatching spaces.
For example, if
p = A C G C C T,
256 Pattern Discovery in Bioinformatics: Theory & Algorithms

Then some condensed submotifs of p are:

submotif k condensed submotif


k-mer
(p = p0 ) = A C GC C T 6 p′0 = ACGCCT
p1 =AC . C C T 5 p′1 = ACCCT
p2 =AC G . C T 5 p′2 = ACGCT
p3 =A . GC . T 4 p′3 = AGCT
p4 = . C GC . . 3 p′4 = CGC
p5 = . C G . C . 3 p′5 = CGC
p6 = . C . . . . 1 p′6 = C
p7 = . . . C . . 1 p′7 = C

Note that distinct condensed submotifs could give rise to the same k-mers.
For example,

p4 6= p5 , but p′4 = p′5 and


p6 6= p7 , but p′6 = p′7 .

This may affect the method, but we ignore this.


Most probabilistic algorithms perform a number of independent trials of
a basic iterant. In this case each iterant is exactly along the lines the Sub-
motif enumeration scheme for some k (Section 9.6.2) picked randomly. Also,
condensed submotifs are used instead of submotifs.
Recall that Step 3 of the enumeration scheme tracks the occurrence of the
l-mer (k-mer for the condensed submotif). In practice, a hash table is used to
store the details of the occurrences. Roughly speaking, a hash table is indexed
by an attribute (or a key) and usually takes only

O(1)

time to access the entry in the table. Thus the tables shown in Step 3 of the
enumeration scheme of Section 9.6.2 can be efficiently constructed, or filled
in. Note that in this case a condensed submotif is a key to the hash table,
which results in considerable reduction in the size of the hash table. Also, at
each iterant a distinct value of k is used and the same hash table is fortified
with more entries. Thus each iteration strengthens the table.
At the end of this process, each entry in the table that shows significant
support is picked up for further scrutiny and the hidden signal is extracted
from the local alignment suggested by the support. In fact this step uses
the learning algorithms discussed in Chapter 8. The reader is directed to the
paper by Buhler and Tompa [BT02] for further details of this algorithm.
The Subtle Motif 257

9.9 A Modular Solution


We conclude the chapter by discussing a method that combines the solution
from two well-studied problems

1. unsupervised (combinatorial) pattern discovery and

2. sequence alignment.

By delegating the task to these subproblems, the method can also handle
deletions and insertions (called indel) in the embedded signal.

Problem 11 (The indel consensus motif problem): Given t sequences


si on an alphabet Σ, a length l > 0 and a distance d ≥ 0, the task is to find
all patterns p, of length l that occur in each si such that each occurrence p′i
on si is at an edit distance (mutation, insertion, deletion) at most d from p.

This approach uses unsupervised motif discovery to solve Problem 10 and


also works well for the more general Problem 9.3
Recall that the signal (‘subtle motifs’) is embedded in t random sequences.
The problem is compounded by the fact that although the consensus motif
is solid (i.e., an l-mer without wild cards or dont-care characters), it is not
necessarily contained in any of the t sequences. However, given an alignment,
the consensus motif satisfying the (l, d) constraint may be extracted with ease.
In other words, one of the difficulties of the problem is that the sequences
are unaligned. The extent of similarity across the sequences is so little that
any global alignment scheme cannot be employed.
This method employs two step:

1. First, potential signal (PS) segments of interest are identified in the


input sequences. This is done by using the imprints of the discovered
motifs on the input.

2. Second, amongst these segments, exhaustive comparison and alignment


is undertaken to extract the consensus motif.

This delineation into two steps also helps address the more realistic version
of the problem that includes insertion and deletion in the consensus motif
(Problem 11). The main focus of this method is in obtaining good quality PS
segments and restricting the number of such segments to keep the problem
tractable.

3A similar approach of using pattern discovery on a simpler problem of finding similarities


in protein sequences has been attempted in [FRP+ 99].
258 Pattern Discovery in Bioinformatics: Theory & Algorithms

The Type I error or false negative errors, in detecting PS segments, are


reduced by using appropriate parameters for the discovery process based on
the statistical analysis of consensus motifs discussed in Section 9.4.
The Type II error or false positive errors are reduced by using irredundant
motifs [AP04] and their statistical significance measures [ACP05] discussed
in Chapter 7. Loosely speaking, irredundancy helps to control the extent of
over-counting of patterns and the pattern-statistics helps filter the true sig-
nal from the signal-like-background. In the scenario where indels (insertions
and/or deletes) are permitted along with mutations, the unsupervised dis-
covery process detects extensible motifs (instead of rigid motifs that have a
fixed imprint length in all the occurrences). Also, the second step uses gapped
alignments.
All nonexact methods are based on profiles or on k-mers, both of which are
rigid. It is reasonable to say that, if the number of indels is much smaller than
the size of the consensus, the chance of recovering the signal by such methods
may be high. However, when the number of indels grow, it is unclear how
these methods would work. Also, it is not immediately apparent how these
methods can accommodate indels, since the rigidity in the profiles or k-mers
is intrinsic to the method. On the other hand, the modular approach of using
extensible motifs is one possible solution to overcome this bottleneck.

Rationale for using combinatorial unsupervised motif discovery. A


motif of length l that occurs across t′ ≤ t sequences provides a local alignment
of length l for the t′ sequences. The best case scenario, for the problem, is
when the embedded motif m is identical in all t sequences and the discov-
ery process detects this single maximal (combinatorial) motif with quorum
t. So the scenarios closer to the best case should have fewer (but important)
maximal motifs. Figure 9.2(a) shows the expected number of motifs with
different values of q and quorum K. Notice that the expected number of
motifs saturates for small values of K and falls dramatically as K increases.
The saturation at lower values occurs since maximal motifs are being sought.
Thus as q increases the saturation occurs at a higher value of K. Figure 9.2(b)
shows the variation of the expected number of maximal motifs with q which is
unimodal, for different values of K. The value of q is determined by the given
problem scenario and thus a large value of K is a good handle on controlling
the number and ‘quality’ of maximal motifs.
The signal is embedded in the background and it is important to exploit
the characteristics that distinguishes one from the other. The background is
assumed to be random. Under this condition, it is easy to see that
1
q=
4
(see Section 9.4). Thus the need is to compare

E[ZK,q ]
The Subtle Motif 259

with
h i
E ZK, 41 ,

the expectation for the random case. See Figure 9.3 for the plots of

log(E[ZK,q ])

against quorum K to compare these expectation curves, particulary around


small values (close to 1 in the Y-axis).
For example, consider the case when

q = 0.75,

this is the approximate value of q for the challenge problem of Section 9.1. In
Figure 9.3, this is shown by the red curve and for large K, say

K ≥ 16,

the expected number of motifs is small. Also, the corresponding expected


numbers for the random case is extremely low, thus providing a strong contrast
in the number of expected motifs. Hence the reasonable choice for the quorum
parameter K is 16 or more, in the unsupervised discovery process.
It must be pointed out that in the case where the embedded motif is changed
with insertions and/or deletions (indels), the q value is computed appropri-
ately using Equation (9.1) and the corresponding expectation curve in Fig-
ure 9.3 must be studied. However, the burden is heavier on the unsupervised
discovery process and an extensible (or, variable-sized gaps) motif discovery
capability can be used.4

9.10 Conclusion
We have discussed several strategies to tackling the problem of finding sub-
tle signals across sequences. This continues to be an active area of research
with very close interaction between biologists, computer scientists and math-
ematicians.

4 Varun
[ACP05] is available at:
www.research.ibm.com/computationalgenomics.
260 Pattern Discovery in Bioinformatics: Theory & Algorithms

9.11 Exercises
Exercise 93 (Distance) Let Σ = {0, 1}. A pattern p of size l is defined on
Σ.

1. Enumerate all p′ , defined on Σ, at a Hamming distance

(i) exactly d from p and


(ii) at most d from p.

2. Enumerate all p′ , defined on Σ, at an edit distance

(i) exactly d from p and


(ii) at most d from p,

where the edit operations allowed are (a) mutation, (b) insertion and (c)
deletion.

Hint: Design an ‘enumeration tree’ particularly for 1(ii) and 2(ii) to avoid
multiple enumerations of the same patterns.

Exercise 94 (Exact neighbor enumeration)

1. Devise an efficient algorithm to generate the neighbors of a string Ci′ of


Equation (9.7) from Ci .

2. What is the running time complexity of the enumeration scheme of Sec-


tion 9.6?

Hint: Note that duplicates must be removed to compute the sets C1 , C2 , . . . , Ct


and C1′ , C2′ , . . . , Ct′ .

Exercise 95 Show that if

d = Hamming(p, p1 )
= Hamming(p, p2 ),

then
2d ≥ Hamming(p1 , p2 ).

Exercise 96 (Enumerations) Let p = A C G C C T.

1. How many submotifs does p have?


The Subtle Motif 261

2. How many distinct k-mers (k = 1, 2, . . . , 6) does p have?


Hint: Note that C is repeated in p that can give the same k-mer from distinct
submotifs.

Exercise 97 What is inexact about the enumeration scheme of Section 9.6.2?


Hint: If d is the edit distance, then how many dot characters, d′ , does the
mask have? Is there a formula for d′ ?

Exercise 98 Modify the combinatorial algorithm of Section 9.7 to utilize


the Hamming distance between vertices. The current algorithm simply uses
nonzero distance.
Hint: See also SP-Star [PS00] and patternbranching [PRP03, KP02a] for
effective utilization of these weights and more.

Comments
The topic of this chapter exemplifies the difficulties with biological reality.
Elegant combinatorics and practical statistical principles, along with biologi-
cal wisdom may be sometimes required to answer innocent-looking questions.
Part III

Patterns on Meta-Data
Chapter 10
Permutation Patterns

Out of clutter, find simplicity;


from discord, find harmony.
- attributed to A. Einstein

10.1 Introduction
In this chapter we deal with a different kind of motif or pattern: one that
is defined by merely its composition and not the order in which they appear
in the data. For example, consider two chromosomes in different organisms.
We study gene orders in a section of the chromosomes of two organisms as
shown below:

s1 = . . . g1 g2 g3 g4 g5 g6 g7 . . .

s2 = . . . g8 g5′ g2′ g4′ g3′ g9 g0 . . .

Genes gi (in s1 ) and gi′ (in s2 ) are assumed to be orthologous genes. Clearly,
it is of interest to note that the block of genes g2 , g3 , g4 , g5 appear together,
albeit in a different order in each of the chromosomes. This collection of genes
is often called a gene cluster. The size of the cluster is the number of elements
in it and in this example the size is 4. Such clusters or sets of objects are
termed permutation patterns.1 They are called so because any one of the
patterns can be numbered 1 to L where L is the size of the pattern and every
other occurrence is a permutation of the L integers. For example in s1 , the
pattern can be numbered as
1 2 3 4,

and in s2 the occurrence is a permutation given as

4 1 3 2.

1 This cluster or set is also called a Parikh vector (Section 10.4) or a compomer (Exercise 118)

265
266 Pattern Discovery in Bioinformatics: Theory & Algorithms

10.1.1 Notation
Recall from Chapter 2 (Section 2.8.2) that Π(s) denotes the set of all char-
acters occurring in a sequence s. For example, if

s = a b c d a,

then
Π(s) = {a, b, c, d}.
However s may have characters that appear multiple times (also referred to as
the copy number). Then we use a new notation Π′ (s). In this notation, each
character is annotated with the number of times it appears. For example,

s = a b b c c b d a c b,
Π(s) = {a, b, c, d},
Π′ (s) = {a(2), b(4), c(3), d}.

Thus element a has copy number 2, b has copy number 4 and so on. Note that
d appears only once and the copy number annotation is omitted altogether.
Given an input string s on a finite alphabet Σ, a permutation pattern (or
πpattern) is a set p ⊆ Σ. p occurs at location i on s if

p = Π (s [i, i + 1, . . . , i + L-1]) ,

where L = |p|. p has multiplicity, if we are interested in the multiple occur-


rences of a σ ∈ p. Although strictly speaking, p is not a set anymore, to avoid
clutter we do not distinguish the two forms of p. For example, consider

s = aacbbb xx abcbab .

The two occurrences of a pattern are shown in boxes. We represent these


occurrences as o1 and o2 and for convenience write them as

o1 = a a c b b b,
o2 = a b c b a b.

If we are interested in multiplicity, or in counting copy numbers, then

p = {a(2), b(3), c}.

Otherwise,
p = {a, b, c}.
The size of the pattern p is written as |p|. In the first case |p| = 6 and in the
second case |p| = 3. Note that in both cases, the length at each occurrence of
the pattern p must be the same.
Permutation Patterns 267

Further, p satisfies a quorum K, if it occurs at some K ′ ≥ K distinct


locations on s given as
Lp = {i1 , i2 , . . . , iK ′ } .
Lp is the location list of p. In the running example, assuming a quorum K = 2,
permutation pattern p occurs at locations 1 and 9 on s, written as
Lp = {1, 9}.

10.2 How Many Permutation Patterns?


It is important to know the total number of permutation patterns that may
occur on a string s for at least two reasons. Firstly, if it is unduly large, one
needs to explore the possibility of reducing this number without compromising
the pattern definition. Secondly, the number is useful in the statistical analysis
of permutation patterns.

p1 = {a, b} Lp1 = {1, 6, 11}


p2 = {b, c} Lp2 = {2, 7, 12}
p3 = {c, d} Lp3 = {3, 8, 13}
p4 = {d, e} Lp4 = {4, 9, 14}
p5 = {a, b, c} Lp5 = {1, 6, 11}
p6 = {b, c, d} Lp6 = {2, 7, 12}
p7 = {c, d, e} Lp7 = {3, 8, 13}
p8 = {a, b, c, d} Lp8 = {1, 6, 11}
p9 = {b, c, d, e} Lp9 = {2, 7, 12}
p10 = {a, b, c, d, e} Lp10 = {1, 6, 11}
FIGURE 10.1: The exhaustive list of permutation patterns p, with |p| >
1, occurring on s = a b c d e a b c d e a b c d e satisfying quorum K = 3.

Let P be the collection of all permutation patterns on a given input string


s of length. What is the size of P ? In other words, what is the maximum
number of elements in P ?
Assuming that a permutation pattern can start at an arbitrary position i
on s and end at another position j > i on s. Thus the number of permutation
patterns is
O(n2 ).
In other words, we estimate an upper bound of n2 on the total number of
patterns, in the worst case. But is this number actually attained? Consider
268 Pattern Discovery in Bioinformatics: Theory & Algorithms

the following example:

s = a b c d e a b c d e a b c d e,

and quorum K = 3. The collection of permutation patterns, p with |p| > 1,


on s is listed in Figure 10.1. This construction shows that the number of such
patterns is
m(m + 1)
,
2
where
n
m = − 1.
3
This construction shows that such a number can indeed be attained.
We next explore the possibility of refining the definition of the pattern to
reduce their number.

10.3 Maximality
In an attempt to reduce the number of permutation patterns in an input
string s, without any loss of information, we use the following definition of a
maximal pattern [LPW05].
Let P be the set of all permutation patterns on a given input string s.
(p1 ∈ P ) is nonmaximal with respect to (p2 ∈ P ) if both of the following hold.
(1) Each occurrence of p1 on s is covered by an occurrence of p2 on s. In
other words, each occurrence of p1 is a substring in an occurrence of p2 .
(2) Each occurrence of p2 on s covers l ≥ 1, occurrence(s) of p1 on s.
A pattern (p2 ∈ P ) is maximal, if there exists no (p1 ∈ P ) such that p2 is
nonmaximal w.r.t. p1 .
It is straightforward to verify the following and we leave the proof as an
exercise for the reader (Exercise 99). Note that this directly follows from the
framework presented in Chapter 4.

LEMMA 10.1
(Maximal lemma) If p2 is nonmaximal with respect to p1 , then p1 ⊂ p2 .

When p2 is nonmaximal with respect to p1 does the following hold

|Lp1 | = |Lp2 | ?

The sizes of the location lists must be the same when each element of the
pattern p1 and p2 has a copy number 1. See Exercise 100 for the possible
Permutation Patterns 269

relationship between |Lp1 | and |Lp2 | when copy number of some elements
> 1.
However, to show that maximality as defined here is valid, it is important
to show the uniqueness of the set of maximal permutation patterns. Again,
this also follows from the framework presented in Chapter 4.

THEOREM 10.1
(Unique maximal theorem) Let M be the set of all maximal permutation
patterns, i.e.,
M = {p ∈ P | there is no (p′ ∈ P ) maximal w.r.t p}

Then M is unique.

PROOF We prove this by contradiction. Assume that M is not unique, i.e.


there exist at least two distinct maximal collections M1 and M2 (M1 6= M2 )
satisfying the definition. Without loss of generality, let
p ∈ M1 and p 6∈ M2 .
Since p 6∈ M2 , there must exist p′ ∈ M2 such that p is nonmaximal with
respect to p′ . In other words,
p′ 6∈ M1 and p′ ∈ M2 ,
and p is nonmaximal with respect to p′ . This contradicts the assumption
that M1 is a maximal collection. Hence the assumption must be wrong and
M1 = M2 .

It is easy to see that a nonmaximal pattern p1 can be ‘deduced’ from p2


and the occurrences of p1 on s can be estimated to be within the occurrences
of p2 .

10.3.1 P=1 : Linear notation & PQ trees


Given an input s and a quorum K, let
P=1
be the set of all permutation patterns on s where each element in any p ∈ P=1
has a copy number 1. In other words each occurrence of the pattern contains
exactly one instance of each character in p.
Recall that in case of substring patterns, the maximal pattern very obvi-
ously indicates the nonmaximal patterns as well. For example a maximal
pattern of the form a b c d implicates
a b, b c, c d, a b c, and b c d
270 Pattern Discovery in Bioinformatics: Theory & Algorithms

as possible nonmaximal patterns, unless they have occurrences not covered


by a b c d.
Do maximal permutation patterns (in P=1 ) have such a simple form? For
p ∈ P=1 , let

M (p) = {p′ ∈ P=1 | p′ is nonmaximal w.r.t p }.

Can p have a representation that captures M (p) (without explicitly enumer-


ation)?
Note that the K occurrences on the input

o1 , o2 , . . . , oK

of p are simply different permutations of the elements of p. So the question


is:
Is there is a representation that captures the ‘commonality’ (or M (p))
in all of these K occurrences?
To answer this question, we study the solution to a classic problem in combi-
natorics, the general consecutive arrangement problem.

Problem 12 (The general consecutive arrangement (GCA) problem)


Given a finite set Σ and a collection I of subsets of Σ, does there exist a
permutation s of Σ in which the members of each subset I ∈ I appear as a
consecutive substring of s?
Mapping the GCA problem to our setting:
Σ is the elements of p,
I is M (p), and
we know that o1 , o2 , . . . , oK are
consecutive (linear) arrangements of the elements of p.

Thus the data structure (called a PQ Tree) used in the solution to the GCA
problem can be used as the representation to capture M (p). See Chapter 13
for an exposition on this.
Consider a pattern p and its collection of nonmaximal patterns M (p) given
in Figure 10.2. The PQ tree representation of the maximal pattern in shown
in Figure 10.2. The root node represents the maximal permutation pattern
given as set (10.1), the Q node represents the nonmaximal patterns given as
set (10.2) and the internal P node represents the nonmaximal pattern given
as set (10.3).
Using the symbol ‘-’ to denote immediate neighbors, since the PQ tree is a
hierarchy, it can also be written linearly as:

((a, b, c, d)-(e-f -g)).

This is the maximal notation of the pattern {a, b, c, d, e, f, g} (set (10.1)).


Permutation Patterns 271

p = {a, b, c, d, e, f, g}, (10.1)


M (p) = {{e, f }, {f, g}, {e, f, g}, (10.2)
{a, b, c, d, }}. (10.3)

a b c d e f g
FIGURE 10.2: The PQ tree notation of the maximal pattern p.

10.3.2 P>1 : Linear notation?


A PQ tree captures the internal structure (in terms of its nonmaximal
permutation components) of a pattern and is an excellent visual representation
of a maximal pattern. However, even an elegant structure such as this has its
limitations: we describe such a scenario where we must use multiple PQ trees
to denote a single maximal pattern.
Given an input s and a quorum K, let P>1 be the set of all permutation
patterns on s where there is some p ∈ P>1 , which has at least one element
that has a copy number > 1.
Let p ∈ P>1 be as follows:

p = {a, b, c(2), d, e, x}. (10.4)

Also, let p have exactly three occurrences on s given as

o1 = d e a b c x c,
o2 = c d e a b x c,
o3 = c x c b a e d.

Assume that none of the elements of p appear elsewhere in the input. What
are the nonmaximal patterns?
Recall that the leaves of a PQ tree are labeled bijectively by the elements
of p. Since p has at least one element σ with copy number c > 1, then the
tree must have c leaves labeled by σ. Assuming we can abuse a PQ structure
thus, can a PQ tree represent all the nonmaximal patterns?
Can we simply rename the two c’s as c1 and c2 ? We can fix this in o1 , but
which c is c1 and which one is c2 in o2 and in o3 ? We must take all possible
renaming into account.
272 Pattern Discovery in Bioinformatics: Theory & Algorithms

We take a systematic approach and rename the elements of o1 as integers


1, 2, . . . , 7 and using this same scheme, we rename o2 and o3 . Since c has a
copy number of 2, c is renamed as two integers 5 and 7. So a c in o2 and o3
is renamed either as 5 or as 7 and is written as [57]. Then
o1 = d e a b c x c = 1 2 3 4 5 6 7,
o2 = c d e a b x c = [57] 1 2 3 4 6 [57],
o3 = c x c b a e d = [57] 6 [57] 4 3 2 1.
The two renamed choices for o2 are
1. o2 = 5123467, hence o3 = 5674321 or o3 = 7654321, and
2. o2 = 7123465, hence o3 = 5674321 or o3 = 7654321.
Thus the four possible scenarios are:
o1 o2 o3
(1) 1234 5 67 5 1234 67 5 67 1234

(2) 1234 5 67 5 1234 67 76 5 1234

(3) 1234 56 7 7 1234 65 56 7 1234

(4) 1234 56 7 7 1234 65 7 65 1234

The nonmaximal patterns are shown as nested boxes. The following trees
capture the nonmaximal patterns: T1,3 represents the first and third cases,
T2 and T4 represent the second and fourth cases respectively.

c
x c

d e a b
c c

d e a b x c d e a b
x c
T1,3 T2 T4

Note that a nonmaximal pattern may be represented in more than one PQ


tree. For this example, is it possible to construct a single PQ tree that captures
all the non-maximal patterns? Given any maximal p ∈ P>1 , in the worst
case how many PQ trees can represent all the nonmaximal patterns? See
Exercise 102.
Permutation Patterns 273

10.4 Parikh Mapping-based Algorithm


We next explore the problem of automatically discovering all the permuta-
tion patterns in one or more strings. We formalize the problem as follows.

Problem 13 (The permutation pattern discovery problem) Given a


string s of length n, over a finite alphabet Σ, and a quorum K < n, find all
permutation patterns p (and the location list Lp ) that occur at least K times.

A permutation pattern of size L cannot necessarily be built from a pattern


of size
L′ < L,
as in the case of substring patterns. Hence this problem is considered harder
than the substring pattern discovery problem and to date there is no clever
way of exhaustively discovering these patterns other than finding patterns of
size
2, 3, . . . , L∗
where L∗ is the size of the largest pattern in s.

Overview of the method. We present a rather straightforward discovery


process: in this algorithm a window of size L is scanned over the input,
observing the characters of the scanned window. This tracks the new as well
as previously seen patterns.
If the size of the alphabet is small, i.e.,

|Σ| = O(1),

then in O(1) time each new or old pattern can be accounted for (using an
appropriate hash function), giving an overall O(n) time algorithm, for a fixed
L. However, this assumption may not be realistic and in general

|Σ| = O(n).

Then the approach needs some more care for efficiency and the discussion
here using Parikh Mapping is adapted from an algorithm given by Amir et
al [AALS03]. The reader is directed to [Did03, SS04, ELP03] for a discussion
on other approaches to this problem.
However, the discovered patterns in the algorithm are not in the maximal
notation. We postpone this discussion to Section 10.5 where the Intervals
Problem is presented. We take all the occurrences,

o1 , o2 , . . . ok
274 Pattern Discovery in Bioinformatics: Theory & Algorithms

of a pattern p, obtained by the algorithm of this section and solve an instance


of the Intervals Problem using the k occurrences as input. In Section 10.5
these occurrences are denoted as

s1 , s2 , . . . sk ,

and the output is the maximal notation of p in terms of a PQ tree. Note that
if p has multiplicities, then the Intervals Problem is invoked multiple times:
see Section 10.3.2 for a detailed discussion on this.

Computing Ψ. Parikh mapping is an important concept in the theory of


formal languages [Par66]. 2 Given an alphabet of size m,

Σ = {σ1 < σ2 < . . . < σm },

let wσi be the number of occurrences of σi in (w ∈ Σ∗ ). Parikh mapping is a


morphism
Ψ : Σ∗ 7→ N k ,
where N denotes nonnegative integers and

Ψ(w) = (wσ1 , wσ2 , . . . , wσm ).

In this section we discuss an efficient algorithm based on this mapping. The


algorithm maintains an array

Ψ[1 . . . |Σ|],

where Ψ[q] keeps count of the number of appearances of letter q in the current
window. Hence, the sum of the values of the elements of Ψ is L. In each
iteration the window shifts one letter to the right, and at most 2 variables of
Ψ are changed:

(1) one variable is increased by one (adding the rightmost letter) and

(2) one variable is decreased by one (deleting the leftmost letter of the
previous window).

Note that for a given window of size L on s, sa sa+1 . . . sa+L−1 , Ψ represents

Π′ (sa sa+1 . . . sa+L−1 ).

2 More than 40 years after the appearance of this seminal paper, during an informal con-
versation, Rohit Parikh, a logician at heart, told me that he had done this work on formal
languages for money! He explained that as a graduate student he was compelled to take a
summer job that produced this work.
Permutation Patterns 275

10.4.1 Tagging technique


It is easy to see that substrings of s, of length L, that are permutations of
the same string are represented by the same Ψ. Each distinct Ψ is assigned
a unique tag - an integer in the range 0 to 2n. The tags are given by using
the naming technique [AIL+ 88, KLP96], which is a modified version of the
algorithm of Karp, Miller and Rosenberg [KMR72].
Assume, for the sake of simplicity, that |Σ| is a power of 2. If |Σ| is not a
power of 2, Ψ can be extended to an appropriate size. The size of the resulting
array is no more than twice the size of the original array.
A tag is completely determined by a pair of previous tags. At level j, the
tag of subarray Ψ1 Ψ2 of size 2j is assigned, where Ψ1 and Ψ2 are consecutive
subarrays of size 2j−1 each. The tags are natural numbers in increasing order.
The process may be viewed as constructing a complete binary tree, which is
the binary tagging tree. Notice that every level only uses the tags of the
level below it, thus the tags are nonnegative integers that can be bounded as
discussed below.
We next illustrate this elegant algorithm using a simple example. Consider
the following example with K = 2 and L = 4:

Σ = {a < b < c < d < e < f },


s = b b a c f b c b a a.

Note that
|Σ| = 6
and the Parikh Mapping array Ψ is padded with the • character so as to make
it a power of 2. This complete example is described in Figure 10.3.

10.4.2 Time complexity analysis


We begin by bounding t, the number of distinct tags generated for a given
input s. This bound not only estimates the space required by the algorithm
but also helps in estimating the time required to search for the existence of
each pair of tags, thus giving a more accurate bound on the running time.
Although it is tempting to compute this number as a function of the window
size L, using the alphabet size |Σ| gives a better bound as shown below.

LEMMA 10.2
The maximum number of distinct tags generated by the algorithm’s tagging
scheme, using a window of size L on a text of length n is

O(|Σ| + n log |Σ|),

where |Σ| is the size of the alphabet.


276 Pattern Discovery in Bioinformatics: Theory & Algorithms

bbac f bcbade b bacf bcbade


6 11
4 5 9 10
1 2 3 − 7 2 8 −
1 2 1 0 0 0 − − 1 1 1 0 0 1 − −
a b c d e f • • a b c d e f • •
(1) (2)

bb acfb cbade bba cfbc bade


11 14
9 10 13 10
7 2 8 − 8 12 8 −
1 1 1 0 0 1 − − 0 1 2 0 0 1 − −
a b c d e f • • a b c d e f • •
(3) (4)

bbac fbcb ade bbacf bcba de


17 6
16 10 4 5
15 2 8 − 1 2 3 −
0 2 1 0 0 1 − − 1 2 1 0 0 0 − −
a b c d e f • • a b c d e f • •
(5) (6)

bbacf b cbad e bbacf bc bade


19 22
18 5 20 21
7 7 3 − 7 8 2 −
1 1 1 1 0 0 − − 1 1 0 1 1 0 − −
a b c d e f • • a b c d e f • •
(7) (8)

FIGURE 10.3: The algorithm run on s = b b a c f b c b a d e with L = 4,


K = 2. (1)-(8) shows the sliding of the window on the input and the change
in the binary name tree at each stage. The tags shown in bold are the ones
that change at each iteration from the previous ones. The run shows that
there are two permutation patterns of size 4 on s: (1) p1 = {a, b, c, f }, tagged
11, with Lp1 = {2, 3}, (2) p2 = {a, b(2), c}, tagged 6, with Lp2 = {1, 6}.
Permutation Patterns 277

PROOF Consider the very first window of size L at position j = 1 on


the string with the corresponding Parikh Mapping array Ψ1 . The number of
distinct tags in the binary tagging tree of Ψ1 is

O(|Σ|).

The height of this tree is


O(log |Σ|).
The total number of iterations is

n − L + 1.

At each iteration, at most log |Σ| changes are made due to addition of a new
character to the right and at most log |Σ| changes are made due to the removal
of an old character to the left. Thus at each iteration j no more than

2 log |Σ|

new tags are generated in the binary tagging tree of Ψj . Thus the number of
distinct tags is
t = O(|Σ| + n log |Σ|).

To give a subarray at level > 1 a tag, we need only to know if the pair of
tags of the composing subarrays has appeared previously. If it did, then the
array gets the tag of this pair. Otherwise, it gets a new tag. Assume that
the first elements of the tag pairs is stored in a balanced tree T1 . Further
the pairs are gathered and yet another tree Tv2 is stored at each node v of T1
which is also balanced. Thus it takes

O((log t)2 )

time to access a tag pair where both T1 and Tv2 are binary searched and t is
the number of distinct tags.
To summarize, it takes
O(|Σ|)
time to initialize the binary tagging tree of Parikh Mapping array Ψ. The
number of iterations is
O(n)
and at each iteration
O(log |Σ|)
changes are made, each of which takes

O((log t)2 )
278 Pattern Discovery in Bioinformatics: Theory & Algorithms

time. Thus the algorithm takes

O(|Σ| + n(log t)2 log |Σ|)

time, for a fixed L. If L∗ the size of the largest pattern on s is not known,
then this algorithm is iterated O(n) times.

10.5 Intervals
The last section gives an algorithm to discover all permutation patterns in
a given string s. We now take a look at a relatively simple scenario: Given K
sequences where n characters appear exactly once in each sequence and each
sequence is of length n, the task is to discover common permutation patterns
that occur in all the sequences.
In other words, s1 can be viewed as the sequence of integers 1, 2, 3, . . . , n
and each of
s2 , s3 , . . . , sK
is a permutation of n integers. See Figure 10.4 for an illustrative example.
Why is this problem scenario any simpler? For input sequences s2 , s3 , . . . sK ,
the encoding to integers allows us to simply study the integers and deduce if
they are potential permutation patterns or not. For example a sequence of
the form
4 6,
can never contribute to a common permutation pattern of size 2 since, in s1
the two are not immediate neighbors. By the same argument, the subsequence

645

can potentially be a permutation pattern. Thus for this problem scenario, it


suffices to store integer intervals, rather than Parikh-maps as was discussed
in the previous section. In Figure 10.4, the interval 1-2 appears in all the last
three sequences. Thus a common permutation pattern is

{1, 2} or {a, d}.

Thus for a pair of integers 1 ≤ i < j ≤ n, let

I = [i, j] = {s[i], s[i+1], . . . , s[j-1], s[j]}.

Then the following check3 yields I’s potential to be a permutation pattern.

3 Heber and Stoye call the difference between the maximum and minimum values the interval
defect.
Permutation Patterns 279

ad cb e ⇒ 1 2 3 45
ce da b ⇒ 3 5 2 14
da be c ⇒ 2 1 4 53
bc ea d ⇒ 4 3 5 12

FIGURE 10.4: A collection of four strings, each defined on the alphabet


{a, b, c, d, e}. The first string is encoded by consecutive integers from 1 to 5,
giving the mappings a ⇔ 1, d ⇔ 2, and so on. Thus the remaining three
strings are encoded as shown.

For each 1 < k ≤ K and each pair of integers, 1 ≤ i < j ≤ n, if


max (πk,i,j ) − min (πk,i,j ) 6= (j − i),

then I = [i...j] cannot be a permutation pattern, where


πk,i,j = Π(sk [i...j]).

In this section we focus on the following problem.

Problem 14 (The intervals problem) Given a sequence s on integers 1,


2, . . ., n, I = [i, j] is an interval if for some l,
Π(s[i...j]) = {l, l + 1, l + 2, . . . , l + (j-i)}.
The task is to find all such intervals I on s.
Given s = 2 1 4 5 3, the four intervals I0 , I1 , I2 , I3 are:

I0 = [1, 5], marked on s as 21453 .

I1 = [1, 2], marked on s as 2 1 4 5 3.

I2 = [3, 4], marked on s as 2 1 4 5 3.

I3 = [3, 5], marked on s as 2 1 4 5 3 .


Clearly, intervals are an alternative representation for permutation patterns
as shown in the following example.

Example 3 Let s = 3 5 2 4 7 6 8 1. Then


permutation patterns intervals
p0 = {1, 2, 3, 4, 5, 6, 7, 8} I0 = [1, 8]
p1 = {2, 3, 4, 5, 6, 7, 8} I1 = [1, 7]
p2 = {2, 3, 4, 5} I2 = [1, 4]
p3 = {6, 7, 8} I3 = [5, 7]
p4 = {6, 7} I4 = [5, 6]
280 Pattern Discovery in Bioinformatics: Theory & Algorithms

How many intervals? Consider

s = 1 2 3 4 . . . n.

Clearly, each of

{1, 2}, {2, 3} ... ... ... {n-1, n},


{1, 2, 3}, {2, 3, 4} . . . ... {n-2, n-1, n},
{1, 2, 3, 4}, {2, 3, 4, 5} . . . {n-3, n-2, n-1, n},
...
{1, 2, . . . , n},

is an interval. Thus given s of length n, it is possible to have

O(n2 )

intervals.

10.5.1 The naive algorithm


Given an instance of Problem (14), this algorithm discovers all the intervals
by a single scan of s from right to left.
At each scan i, we move a pointer j from i + 1 up to n and check if I = [i, j]
is an interval. We keep track of the highest and the lowest value of I, and
that is sufficient to check if it is an interval. Thus the interval checking can
be done in O(1) time.
For 1 ≤ i < j ≤ n, recall

[i, j] = Π(s[i . . . j]).

Then we define the following:


1. l(i, j) = min[i, j] and u(i, j) = max[i, j],
2. R(i, j) = u(i, j) − l(i, j) and r(i, j) = j − i,
3. f (i, j) = R(i, j) − r(i, j)
The following is rather a straightforward observation but critical in designing
an efficient algorithm.

LEMMA 10.3
Let s be a sequence of n integers where each number appears exactly once.
Then for all 1 ≤ i < j ≤ n, the following statements hold.
1. f (i, j) ≥ 0. In other words,it cannot take negative values.
2. If f (i, j) = 0, then [i, j] is an interval.
Permutation Patterns 281

Algorithm 7 The Intervals Extraction


(1) FOR i = n − 1 DOWNTO 1 DO
(2) u ← s[i], l ← s[i]
(3) FOR j = i + 1 TO n DO
(4) IF s[j] > u, u ← s[j]
(5) IF s[j] < l, l ← s[j]
(6) f ← (u − l) − (j − i)
(7) IF (f = 0), OUTPUT [i, j]
(8) ENDFOR
(9) ENDFOR

10.5.1.1 Analysis of algorithm (7)


The algorithm has two loops:

1. Lines (1)-(9) is the main loop (that is executed n − 1 times), and

2. Lines (3)-(8) is the inner loop (n − i times).

Lines (2), (4), (5), (6), (7) take O(1) time each. Lines (4), (5), (6), (7) are
executed
n(n − 1)
1 + 2 + . . . + (n − 2) + (n − 1) =
2
times. Thus the entire algorithm takes

O(n2 )

time.
Notice that the number of intervals in s could be O(n2 ). Thus an algorithm
that outputs all the intervals must do at least O(n2 ) work. But what if s has
only O(n) intervals, can we do better?
An algorithm whose time complexity is a function of the output size is
called an output sensitive algorithm. Let NO be the number of intervals in a
string s of length n = NI . We next describe an output sensitive algorithm
that takes time
O(NO + NI ).

10.5.2 The Uno-Yagiura RC algorithm


One man’s data structure is another man’s algorithm, goes an old computer
science adage. Here we discuss an algorithm that crucially depends on the
linked list data structure that was briefly discussed in Section 2.6: this enables
the algorithm to add elements to a list and access them in a LIFO (Last In
First Out) order that makes the overall algorithm efficient.
This is an output sensitive algorithm which we call the Uno-Yagiura RC
algorithm [UY00]. This follows the basic structure of Algorithm (7), but cuts
282 Pattern Discovery in Bioinformatics: Theory & Algorithms

down the number of candidates for the checking at lines (6) and (7). Hence,
the authors Uno and Yagiura call this the Reduce Candidate (RC) algorithm.
We give the pseudocode of the algorithm as Algorithm (8).

Algorithm 8 The RC Intervals Extraction


(0-1) CreateList(n − 1, n, s[n], L, N IL)
(0-2) CreateList(n − 1, n, s[n], U, N IL)

(1) F OR i = n − 1 DOW N T O 1
(2) InsertLList(i, i + 1, s[i], L)
(3) InsertU List(i, i + 1, s[i], U )
(4) ScanpList(i, U, L)
(5) EN DF OR

Algorithm 9 The LIFO List Operations


InsertUList(i, j, v, Hd)
t ← Hd
WHILE (t 6= NIL) & (v > t.val)
CreateList(i, j, val, ptr, nxt) t ← t.next
NEW(ptr) Create(i,j,v,Hd,t)
ptr.i ← i, ptr.j ← j
ptr.val ← val InsertLList(i, j, v, Hd)
ptr.next ← nxt t ← Hd
WHILE (t 6= NIL) & (v < t.val)
t ← t.next
Create(i,j,v,Hd,t)

We describe the algorithm and its various aspects in the following five parts.
We conclude with a concrete example.

1. Identification of potent indices.

2. Construction of two sequence of functions:

u(n − 1, j), u(n − 2, j), . . . , u(2, j), u(1, j), and


l(n − 1, j), l(n − 2, j), . . . , l(2, j), l(1, j).

Each function, u(i, j) and l(i, j) is defined over j = i + 1 up to n.

3. p-list of potent indices.

4. Correctness of the RC algorithm.


Permutation Patterns 283

5. Time complexity of the RC algorithm. Let NO be the number of inter-


vals in s.

(a) The u(·, ·) function list U is processed in O(n) time.


(b) The l(·) function list L is processed in O(n) time.
(c) The list of potent indices p-list is processed in O(NO ) time.

1. Potent indices. We first identify certain j’s (index) called potent.4 For
a fixed i, for some i < jp ≤ n, let

ujp = u(i, jp ) and ljp = l(i, jp ).

jp is potent with respect to (w.r.t.) i if and only if jp is the largest possible j


satisfying
u(i, j) = ujp and l(i, j) = ljp .
Consider the following example where the input s is shown in bold below. Let
i = 2.
j 1 2 3 4 5 6 7
s[j] 2 4 3 7 6 1 5
↑ ↑ ↑
i j1 j2

1. Let j1 (> i) = 5. Is j1 potent w.r.t. i = 2?


To answer this, we compute the following:

u(2, 5) = 7 and l(2, 5) = 3.

Also, j1 is the largest possible value of j with

u(2, j) = 7 and l(2, j) = 3.

Hence j1 = 5 is potent w.r.t. i = 2.

2. Let j1 (> i) = 6. Is j2 potent w.r.t. i = 2?


We again compute the following:

u(2, 6) = 7 and l(2, 6) = 1.

But,
u(2, 7) = 7 and l(2, 7) = 1.
But j = 7 > j2 , hence j1 = 6 is not potent w.r.t. i = 2.

4 Uno and Yagiura in their paper use unnecessary j’s, which in a sense is complementary to

the idea of potent j. I define potent j’s for a possible simpler exposition.
284 Pattern Discovery in Bioinformatics: Theory & Algorithms

In other words, a j is potent if [i′ , j] is potentially an interval, i.e., f (i′ , j) = 0,


for some i′ ≤ i, .

LEMMA 10.4
If [i, j] is an interval, then j must be potent w.r.t. i.

PROOF It is easier to prove the contrapositive:


If j > i is not potent w.r.t. i, then [i, j] is not an interval.
Assume the contrary, i.e., [i, j] is an interval. If j is not potent with respect
to i, then there exists a j ′ > j such that

u(i, j) = u(i, j ′ ) and l(i, j) = l(i, j ′ ).

Then clearly
l(i, j) ≤ s[j ′ ] ≤ u(i, j)
which leads to a contradiction. Hence the assumption must be wrong.

Thus, in conclusion, only the potent j ′ s are sufficient to extract all the
intervals in s. In the algorithm, p-list is the list of potent j’s (in increasing
value of the index).
We begin by studying some key properties of u(i, j), l(i, j) and f (i, j) func-
tions.

LEMMA 10.5
(Monotone functions lemma) Let i ≥ 1 be fixed and for i < j1 < j2 ≤ n,
the following hold.
• (U.1) u(·, ·) is a nonincreasing function, i.e., u(i, j1 ) ≤ u(i, j2 ).
• (L.1) l(·) is a nondecreasing function, i.e., l(i, j1 ) ≥ l(i, j2 ).

It is straightforward to verify the statements and we leave this as an exercise


for the reader (Exercise 108).

LEMMA 10.6

• (F.1) Let 1 ≤ i < j1 < j2 ≤ n. If

f (i, j1 ) > 0, and


f (i, j1 ) > f (i, j2 ),

then [i′ , j1 ] is not an interval for any 1 ≤ i′ ≤ i.


Permutation Patterns 285

• (F.2) Let 1 ≤ i1 < i2 < j1 < j2 ≤ n. Further, let the following hold.
u(i1 , j1 ) = u(i2 , j1 ) and l(i1 , j1 ) = l(i2 , j1 ) and
u(i1 , j2 ) = u(i2 , j2 ) and l(i1 , j2 ) = l(i2 , j2 ).
Then
f (i1 , j1 ) − f (i1 , j2 ) = f (i2 , j1 ) − f (i2 , j2 ).

PROOF Proof of statement (F.1): Let


S(i, j) = {k | l(i, j) ≤ k ≤ u(i, j)},
then
f (i, j) = |S(i, j) \ [i, j]|,
i.e., f (i, j) is the number of elements of S(i, j) missing in [i, j]. Further since,
[i, j1 ] ⊂ [i, j2 ] and f (i, j1 ) > f (i, j2 ),
then there must be some j1 < j ′′ ≤ j2 such that
l(i, j1 ) < s[j ′′ ] < u(i, j1 ).
In other words, s[j ′′ ] lies within the minimum and maximum values of the
interval. Thus
l(i′ , j1 ) ≤ l(i, j1 ) < s[j ′′ ] < u(i, j1 ) ≤ u(i′ , j1 ),
and since
s[j ′′ ] 6∈ [i′ , j1 ],

[i , j1 ] can never be an interval. This ends the proof of statement (F.1).
We leave the proof of statement (F.2) as Exercise 109 for the reader.

Figure 10.5 illustrates the facts of the lemma for a simple example. Notice
that, for a fixed i (=1), the function u(i, j) is nondecreasing and l(i, j) is
nonincreasing as j increases. The p-list of potent j’s is shown at the bottom.
We explain a few facts here.
1. j = 3 is potent since it is the largest j with
u(i, j) = 6 and l(i, j) = 4.

2. j = 5 is potent since it is the largest j with


u(i, j) = 8 and l(i, j) = 4.

3. By (F.1) of the lemma,


(a) there is no interval of the form [·, 2] since
(f (1, 2) = 1) > (f (1, 3) = 0), and
(b) there is no interval of the form [·, 4] since
(f (1, 4) = 1) > (f (1, 5) = 0).
286 Pattern Discovery in Bioinformatics: Theory & Algorithms

9
↓ ↓ ↓ ↓ ↓
801
u(i,j) j 3 4 5 6 7 8
7
s[j] 5 8 7 3 9 2
601
6 5 8 7 3 9 2 u(i, j) 6 8 8 8 9 9
i=1 5 1
0 l(i, j) 5 5 5 3 3 2
4
l(i,j) 0
31 R(i, j) 1 3 3 5 6 7
2 r(i, j) 1 2 3 4 5 6
j
2 4 6 8
p f (i, j) 0 1 0 1 1 1

(a) s = 4 6 5 8 7 3 9 2. (b) i = 2.

FIGURE 10.5: Illustration of Lemmas (10.5) and (10.6). (a) The input
string s is shown in the center. The figure shows the snapshot of the u(i, j)
and l(i, j) functions when index i = 2 pointing to 6 in s. As j goes from
3 to 8: (1) u(i, j) is the nondecreasing function shown on top (U.1 of the
lemma), (2) l(i, j) is the nonincreasing function shown at the bottom (L.1 of
the lemma), and (3) each of five potent indices (j = 3, 5, 6, 7, 8) of the p-list
are shown as little hollow circles in the bottom row. Only two of the potent
j’s, j = 3, 5 evaluate f (i, j) to 0. These are shown as dark circles. (b) The
tabulated values of the functions. The potent j’s are marked by arrows.
Permutation Patterns 287

2. List of u(·, ·), l(·) functions. Consider the task of constructing the list
of u(·, ·) and l(·) functions:
For i = (n − 1), down to 1,
construct u(i, j) and l(i, j), for i < j ≤ n.
At iteration i, the function u(i, j) and l(i, j) is evaluated (or constructed
for the algorithm). At i, a straightforward (say like that of the algorithm of
Section 10.5.1) process scans the string from n down to i, taking O(i) time to
compute u(·, ·) and l(·, ·). Since there are n − 1 iterations and
(n − 1)n
1 + 2 + 3 + . . . + (n − 1) = ,
2
this task takes O(n2 ) time for all the n − 1 iterations.
The RC algorithm performs the above task in only O(n) time. This is done
by a clever update at each iteration in the following manner.
1. The u(·, ·) and l(·, ·) function is stored as a list with the ability to add
and remove from one of the list, called the head of the list. This is also
called the Last In First Out (LIFO) order of accessing elements in a list.
The algorithm maintains a U list to store values of u(·, ·) and an L list
to store l(·, ·). However, only distinct elements are stored, along with
the largest index j that has the value. Thus, if

u(i, j−1) < u(i, j) = u(i, j+1) = . . . = u(i, j+l) < u(i, j+l+1),

for some l, then s[j + l] is stored (along with the index (j + l)) in the list.
By the same reasoning, s[j − 1] is stored (along with the index (j − 1))
and is the head of the list if i = j − 2.
For example consider the following segment of s and let i = 2.
j 2 3 4 5 6 7
s[j] 4 3 7 6 1 5
u(2, j) → 7→ 7 → 6 →5→ 5
U −→ 7 → 6 −→ 5
Note that U has only three elements. The head of the list points to
element 7 (with index j = 4).
j 2 3 4 5 6 7
s[j] 4 3 7 6 1 5
l(2, j) → 1 → 1 → 1 → 1 → 5
L −→ 1 → 5
Note that L has only two elements. The head of the list points to
element 1 (with index j = 6).
2. At each iteration, an element may be is added to the list (U or L or
both), and zero, one or more consecutive elements may be removed in
order from the head of the list.
288 Pattern Discovery in Bioinformatics: Theory & Algorithms

This follows from Lemmas (10.7) and (10.8): the first deals with the U
list and the second is an identical statement for the L list. The following
can be verified and we leave the proof of these lemmas as an exercise
for the reader.

LEMMA 10.7
For a fixed i, consider the two functions

(a) u(i, j) defined over i < j ≤ n, and


(b) u(i − 1, j) defined over (i − 1) < j ≤ n.

Then u(i − 1, j) is defined in terms of u(i, j) as follows:


If s[i − 1] < u(i, i + 1), then

u(i, i + 1) if j = i
u(i − 1, j) = .
u(i, j) otherwise.

If s[i − 1] > u(i, i + 1), then



s[i − 1] if j = i or s[j] > u(i, j)
u(i − 1, j) = .
u(i, j) otherwise.

An identical result holds for the l(·, ·) function.

LEMMA 10.8
For a fixed i, consider the two functions

(a) l(i, j) defined over i < j ≤ n, and


(b) l(i − 1, j) defined over (i − 1) < j ≤ n.

Then l(i − 1, j) is defined in terms of l(i, j) as follows:


If s[i − 1] > l(i, i + 1), then

l(i, i + 1) if j = i
l(i − 1, j) = .
l(i, j) otherwise.

If s[i − 1] < l(i, i + 1), then



s[i − 1] if j = i or s[j] > l(i, j)
l(i − 1, j) = .
l(i, j) otherwise.
Permutation Patterns 289

3. p-list of potent indices. Note that the lists U and L are already
sorted by j. Merging the two lists, gives the p-list or the list of potent j’s.
For example,
j 2 3 4 5 6 7
s[j] 4 3 7 6 1 5
U −→ 7 → 6 −→ 5
L −→ 1 → 5
p −→ → → →
p has four elements with the head pointing to index j = 4.
By Lemma (10.6), there is no interval of the form [i′ , j1 ] if

f (i, j1 ) > f (i, j2 ).

Hence, p-list can be pruned by removing j1 from the head of the list. We
make the following claim:

A p-list that is pruned only at the head of the list, possibly multiple
times, is such that for any two consecutive indices, j1 and j2 , in the
pruned list,
f (i, j1 ) ≤ f (i, j2 ).

This observation is crucial in asserting both the correctness and in the justi-
fication of the output-sensitive time complexity of the algorithm.

4. Correctness of the RC algorithm. The correctness of the algorithm


follows from Lemma (10.4) and (F.1) of Lemma (10.6). The former ensures
that if [i, j] is an interval then j must be potent w.r.t. i. Note that the converse
(i.e., if j is potent w.r.t. i, then [i, j] must be an interval) is not true and the
latter claim gives a way of trimming the potential indices. Simultaneously,
it also gives an efficient way of doing so by ensuring that the pruned p-list
is traversed only as long as the f (·) values on the consecutive indices are
nondecreasing.
Since for every other potent j (w.r.t. i), [i, j] is explicitly checked to see if
it is an interval, the algorithm does not miss any interval and the output is
correct.
In other words, every interval is of the form [i, j] where j is potent w.r.t
i (captured in the potent list) and the potent j’s that are not intervals are
pruned using (F.1). This also ensures that multiple intervals can be found
only as consecutive elements at the head of the list.

5. Time complexity of the RC algorithm.

1. An element is added at most once to the list (U or L list) and removed


exactly once from the list. Since the total number of elements is n and
290 Pattern Discovery in Bioinformatics: Theory & Algorithms

only consecutive elements are removed (without having to traverse the


list in search of elements to be removed), all the n − 1 iterations take

O(n)

time, for each list.


2. The elements in the p-list may be accessed multiple times. However, as
discussed in the previous paragraphs, p-list is traversed from the head
of the list only to report intervals (without having to search the entire
list for intervals), thus it takes

O(NO )

time where NO is the number of intervals in the data.


Let n = NI , the size of the input. Thus the algorithm takes time

O(NI + NO ).

6. Concrete example. Figure 10.6 gives a complete concrete example.


When the input is a string s of length n, the index i scans the input from
n − 1 down to 1. At each scan i, the algorithm updates two lists (the U
and the L list) and processes for a potential j (shown as p-list) and emits all
intervals of the form [i, ·]. Notice that in Algorithm (10) this was an inner
loop (taking time O(n)) and the new procedure is outlined as Algorithm (8).
It begins by initializing two lists, U to store u(i, j) and L to store l(i, j) in
lines (0-1) and (0-2) respectively. The U list stores the triplet:

(i, j, u(i, j)),

and the L list stores the triplet:

(i, j, l(i, j)).

Both the lists are initialized to point to the triplet:

(n − 1, n, s[n]).

The initialization is shown in Figure 10.6(1). To avoid clutter, only the value
u(i, j) and l(i, j) is shown in the U and L lists respectively, and i and j are
shown separately to track the iterations.
The algorithm loops through lines (1) through (5). At each iteration or
position i, the algorithm maintains the upper bound information u(i, j) and
the lower bound information l(i, j) for each i < j ≤ n in the two lists U and L
respectively. The two lists store only distinct elements as shown in the figure.
Recall from Lemma (10.5) that for a fixed i, as j goes from (i + 1) to n,
Permutation Patterns 291

U 9 U 9
j 9 j 8 9
L 2 L 3 2
p p
f=6 f=5 f=5
1465873 9 2 146587 3 92
↑ ↑
i=8 i=7
(1) (2)

U 7 9 U 8 9
j 7 8 9 j 6 7 8 9
L 3 2 L 7 3 2
p p
f=3f=4 f=0f=3f=3
14658 7 392 1465 8 7392
↑ ↑ [5, 6]
i=6 i=5
(3) (4)

U 8 9 U 6 8 9
j 5 6 7 8 9 j 4 5 6 7 8 9
L 5 3 2 L 5 3 2
p p
f=1 f=2 f=0 f=2
146 5 87392 14 6 587392
↑ ↑ [3, 4]
i=4 i=3
(5) (6)

FIGURE 10.6: The Uno-Yaguira RC Algorithm on s = 1 4 6 5 8 7 3 9 2.


The input is scanned right-to-left with the i index moving from 8 down to 1
as shown in each figure in (1)-(8). U is the u-list, L is the l-list and the of
potent j’s are shown in the p-list. Continued in Figure 10.7.
292 Pattern Discovery in Bioinformatics: Theory & Algorithms

U 6 8 9
j 3 4 5 6 7 8 9
L 4 3 2
p
f=0 f = 0 f = 0f = 0 f = 0
1 4 6587392
↑ [2, 4], [2, 6], [2, 7], [2, 8], [2, 9]
i=2

(7)

U 4 6 8 9
j 2 3 4 5 6 7 8 9
L 1
p
f=2 f=2 f=1 f=0
1 4 6587392
↑ [1, 9]
i=1

(8)

U 9
j 2 3 4 5 6 7 8 9
L 1
p
f=0
(9)

FIGURE 10.7: Continued from Figure 10.6. Each potent j is shown by


a circle: a hollow circles indicates a new potent j computed at that scan step
and a solid j circle indicates an older potent j from earlier steps. The intervals
are emitted at steps (4), (6), (7) and (8) are shown in the bottom row.
Permutation Patterns 293

1. u(i, j) is a nondecreasing function (U.1 of the lemma) and

2. l(i, j) is a nonincreasing function (L.1 of the lemma).

Thus, by the monotonicity of functions u(·, ·) and l(·, ·)), a new element in the
list is only added at the head of the list. So this operation takes

O(1)

time. The pseudocode for this operation is given in Algorithm (9) as


InsertLList and InsertU List.
ScanpList(i, U Hd, LHd) is an important component of the algorithm that
‘scans the p-list’: this checks for intervals and outputs them. We assume that
routine also maintains the p-list of potent indices. Note that in practice p-list
need not be explicitly maintained as a separate list but can be obtained by
traversing the U and L lists. We describe the scanning process here through
the concrete example and the algorithmic translation of this description is
assigned as Exercise 111 for the reader.
In an attempt not to overwhelm the reader, we take up just a few scenarios
from the concrete example to underline the essence of the approach. However,
it is instructive to follow the example through in its entirety in Figures 10.6
and 10.7.
Scenario 1. When the scanning position is advanced from i to (i − 1), the
U and L lists are updated by inserting s[i − 1] into the lists as shown in
Figure 10.6, to maintain

1. U as a decreasing (to the left) list and

2. L as in increasing (to the left) list.

Thus if the new element s[i − 1] is added to the list, it can only be the head
of the list.
The list of potent j’s, the p-list in Figure 10.6 can be computed from the
U list and L list by traversing the two lists from the head and using the pair
(i, j ′ ) where j ′ is the largest j such that

u(i, j) = u(i, j ′ ) or l(i, j) = l(i, j ′ ).

For example, consider Figure 10.6(6). Here i = 3 and the four potent j’s
are:

1. j = 4 with u(3, j) = 6 and l(3, j) = 5,

2. j = 6 with u(3, j) = 8 and l(3, j) = 3,

3. j = 8 with u(3, j) = 9 and l(3, j) = 3 and

4. j = 9 with u(3, j) = 9 and l(3, j) = 2.


294 Pattern Discovery in Bioinformatics: Theory & Algorithms

A potent j is fresh at iteration i, if it is computed during the iteration (scan)


i and is shown as a hollow circle in the figure. The j’s that are not fresh are
shown as solid circles.
Only the first potent j = 4 is fresh and the other three had already been
computed before and are shown as solid circles. The function f (i, j) is com-
puted for all the fresh potent j’s only.
If f (i, j) = 0, the interval [i, j] is emitted (as output). However,
if f (i, j1 ) > f (i, j2 ) for fresh potent j’s with j1 < j2 ,
then by (F.1) of Lemma (10.6), the element j1 is removed from the p-list, and
also from U and L lists, if it belonged in these lists.
Scenario 2. When we encounter the first potent j ′ , along the p-list satisfying
1. j’ is not fresh, and
2. f (i, j ′ ) 6= 0,
then we stop the traversal of the list. This ensures that the list is being
traversed only when an output is being emitted (f (i, j) = 0).
Consider Figure 10.7(7). At i = 2, potent j = 4 evaluates f (i, j) = 0,
thus interval [2, 4] is emitted. Then the traversal of the p-list continues to
j = 6, 7, 8, 9 where in each case f (i, j) evaluates to 0, thus further emitting
the intervals
[2, 6], [2, 7], [2, 8], [2, 9].
Thus the p-list is traversed only as long as an output is being emitted.
Scenario 3. Consider Figure 10.7(8). Here i = 1. The fresh potent j’s are
2, 4, 7 and 9. First
f (1, 2), f (1, 4)
are each evaluated to be 2. Then f (1, 7) is computed to be 1, hence both
j = 4 and then j = 2 are removed (by (F.1) of Lemma (10.6)): thus head of
U list points to 8 and head of L list continues to point to 1.
Next f (1, 9) is evaluated to be 0, so the potent j = 7 is removed from
the p-list and the head of U list is now 9 (head of L list continues to be
1). The interval [1, 9] is emitted and the U , L and p lists are as shown in
Figure 10.7(9).
This concludes the description of the concrete example.

10.6 Intervals to PQ Trees


Here we discuss how to encode the intervals as PQ trees in time linear
in the size of the interval. First we identify a special set of intervals called
irreducible and then present the algorithm which uses this to give a very
efficient algorithm.
Permutation Patterns 295

10.6.1 Irreducible intervals


(p ∈ P) is reducible if there exists (p1 6= p), (p2 6= p) ∈ P such that
p1 ∩ p2 6= φ, and
p1 ∪ p2 = p.
A pattern (p ∈ P) that is not reducible is irreducible. An interval [i, j] is
reducible if there exists
i < j1 ≤ j2 < j such that [i, j2 ] and [j1 , j]
are intervals. An interval that is not reducible is called irreducible.
Recall that patterns and intervals are two different representations of the
same entity.

Example 4 Let s = 3 5 2 4 7 6 8 1. Then


irreducible permutation irreducible
patterns intervals
p0 = {1, 2, 3, 4, 5, 6, 7, 8} I0 = [1, 8]
p1 = {2, 3, 4, 5, 6, 7, 8} I1 = [1, 7]
p2 = {2, 3, 4, 5} I2 = [1, 4]
p3 = {6, 7, 8} I3 = [5, 7]
p4 = {6, 7} I4 = [5, 6]
Note that in the example, p1 = p2 ∪p3 , but p2 ∩p3 = φ, hence p1 is irreducible.

Example 5 Let s = 3 5 2 4 6 7 8 1. Then


irreducible permutation irreducible
patterns intervals
p0 = {1, 2, 3, 4, 5, 6, 7, 8} I0 = [1, 8]
p1 = {2, 3, 4, 5, 6, 7, 8} I1 = [1, 7]
p2 = {2, 3, 4, 5} I2 = [1, 4]
p3 = {6, 7} I3 = [5, 6]
p4 = {7, 8} I4 = [6, 7]
Note that
p = {6, 7, 8} (interval [5, 7])
is not irreducible since
p = p3 ∪ p4 with p3 ∩ p4 = {7}.
In other words, interval [5, 7] is reducible since [5, 6] and [6, 7] are intervals.
Our next step is to design an algorithm that extracts the irreducible inter-
vals. This algorithm is based on the Uno-Yaguira RC algorithm. We begin
by identifying some special j’s in the p-list:
i
jmin is the minimum j such that f (i, j) = 0 and
i
jmax is the maximum j such that f (i, j) = 0.
296 Pattern Discovery in Bioinformatics: Theory & Algorithms

j 2 3 4 5 6 7 8 9 10 11 12 13 14
p
f=0 f=0 f=0 f=0f=0f=0 f=0f>0

FIGURE 10.8: Here i = 1 and the three irreducible intervals are shown
by arrows at j = 3, j = 9 and j = 13 representing intervals [1, 3], [1, 9] and
[1, 13] respectively. Intervals [1, 5] [1, 7], [1, 10] and [1, 11] are not irreducible,
however f (1, j) = 0, for j = 5, 7, 10, 11. While scanning p list for irreducible
intervals of the form [1, ·], the j’s for which f (i, j) is actually evaluated are
j = 3, 9, 13, 14. The scanning terminates when f (i, j) > 0 (here at j = 14).

LEMMA 10.9
Let
1 ≤ i < j1 < j < j2 ≤ n.
Then if
[i, j1 ] and [j1 , j2 ] are intervals,
then
[i, j]
is not a irreducible interval.

PROOF We first show that [j1 , j] is an interval: the proof of this statement
is not very difficult and left as an exercise for the reader (Exercise 115). Next,
the interval [i, j] cannot be irreducible, since there are two other intervals
[i, j1 ] and [j1 , j2 ] that overlap and their union is [i, j].

This simple observation helps design a very efficient algorithm to detect


only the irreducible intervals. We explain this through an example. This is
also termed the ScanpListirreducible(·) operation in Algorithm (10).
Consider Figure 10.8. The list of potent j’s is

(3, 5, 7, 9, 10, 11, 13, 14)

marked by solid circles for i = 1. Scanning the p-list:

1. The head of the list, j = 3 evaluates f (1, 3) to 0.


[1, j=3] is a irreducible interval since it is the smallest interval with
i = 1.
3 3
jmin = 5 and jmax = 7, shown by the curved segments in the figure,
which had been computed in the previous steps.
Permutation Patterns 297

3
2. The scanning of p-list now jumps to the element following jmax = 7,
which in this example is 9.
f (1, 9) evaluates to 0.
9 9
Again jmin = 10 and jmax = 11, which had been computed before.
9
3. The scanning of p-list now jumps to the element following jmax = 11,
which here is 13.
f (1, 13) evaluates to 0, but there are no intervals of the form [13, ·].
4. So, the scanning continues to the next element on the list, 14.
f (1, 14) evaluates to a nonzero value and the scanning stops.
1 1
Next, jmin is updated to 3 and jmax is updated to 13, for subsequent iterations.

Algorithm 10 The Irreducible Intervals Extraction


CreateList(n-1,n,s[n],LHd,NIL)
CreateList(n-1,n,s[n],UHd,NIL)
n n
=⇒ jmin ← n, jmax ←n

FOR i = n − 1 DOWNTO 1 DO
InsertLList(i,i+1,s[i],LHd)
InsertUList(i,i+1,s[i],UHd)
=⇒ ScanpListirreducible(i,UHd,LHd)
i i
=⇒ Update jmin , jmax
ENDFOR
To summarize, the RC intervals algorithm can be modified to compute the
irreducible intervals and this is shown as Algorithm (10). The lines marked
with right arrows on the left are the new statements introduced here.
The last paragraph summarized the ScanpListirreducible(·) routine, and the
workings of the other routines are straightforward and are left as an exercise
for the reader.
A complete example of computing irreducible intervals on s = 3 2 4 6 5 7 8 1 9
is shown in Figure 10.9.
Correctness of algorithm (10). The algorithm emits the same intervals
as the RC algorithm except the ones suppressed by the scan jumps. This is
straightforward to see from Lemma (10.9).
We first establish a connection between irreducible intervals and PQ trees.
We postpone the analysis of the complexity of the algorithm to after this
discussion.

10.6.2 Encoding intervals as a PQ tree


Consider a PQ tree T whose leaf nodes are labeled bijectively by the integers
1, 2, . . . n. For a node v let
I(v) = {i | the leaf node labeled by i is reachable from node v}.
298 Pattern Discovery in Bioinformatics: Theory & Algorithms

j 7 8 9 j 6 7 8 9
(7,8) (1,8) (1,9) (5,7) (5,8) (1,8) (1,9)
p p
f=0f=5 f=1f=1f=4
32465 7 819 3246 5 7819
↑ [6, 7] ↑
i=6 i=5

6 6
(1) jmin = jmax =7 (2)

j 5 6 7 8 9 j 4 5 6 7 8 9
(5,6) (5,7) (5,8) (1,8) (1,9) (4,6) (4,7) (4,8) (1,8) (1,9)
p p
f=0f=0 f=3 f=0f=0 f=2
324 6 57819 32 4 657819
↑ [4, 5], [4, 7] ↑ [3, 5], [3, 7]
i=4 i=3

4 4 3 3
(3) jmin = 5, jmax =7 (4) jmin = 5, jmax =7

FIGURE 10.9: The Heber-Stoye Algorithm on s = 3 2 4 6 5 7 8 1 9. The


input is scanned right-to-left with index i and the upper and lower bound
corresponding to each j is shown in round brackets. The irreducible intervals
i i
and the jmin , jmax values at steps (3), (4) and (6) are also shown. See Fig-
ure 10.10 for continuation (for (5) and (6)) of the example and text for further
details.
Permutation Patterns 299

j 3 4 5 6 7 8 9 j 2 3 4 5 6 7 8 9
(2,4) (2,6) (2,7) (2,8) (1,8) (1,9) (2,3) (2,4) (2,6) (2,7) (2,8) (1,8) (1,9)
p p
f=1 f=1f=1f=1f=2 f=0f=0 f=0f=0
3 2 4657819 3 2 4657819
[1,2], [1,3],
↑ ↑
[1,8], [1,9]
i=2 i=1

1 1
(5) (6) jmin = 2, jmax =9

FIGURE 10.10: Continuation of the example of Figure 10.9.

1. For each P node v, the interval is given by I(v).


2. Let v be a Q node with l children written as the sequence

v1 v2 . . . vl .

Then for 1 ≤ j1 ≤ j2 ≤ l,

I(vj1 ,j2 ) = I(vj1 ) ∪ I(vj1 +1 ) ∪ . . . ∪ I(vj2 ).

The set of intervals denoted by the Q node v are

I(v) = {I(vj1 ,j2 ) | 1 ≤ j1 ≤ j2 ≤ l}.

Let V1 be the set of P nodes and V2 the set of Q nodes in T . Then the set of
intervals encoded by this PQ tree T is:
! !
[ [ [
I(T ) = {I(v)} I(v) (10.5)
v∈V1 v∈V2

In this section we show that the collection of irreducible intervals can be


organized as a PQ tree. Recall that any two irreducible intervals,

I1 = [k11 , k12 ], I2 = [k21 , k22 ],

satisfy one of the following:


1. (contained or nested) k21 ≤ k11 < k12 ≤ k22 , without loss of generality,
or,
2. (overlap) k12 = k21 , or,
300 Pattern Discovery in Bioinformatics: Theory & Algorithms

3. (disjoint) k11 < k12 < k21 < k22 , without loss of generality.

Each leaf node of the PQ tree is labeled with an integer 1 ≤ i ≤ n. The


overlapping irreducible intervals can be arranged in the left-to-right order,
since they overlap by a single element: this can be represented by a Q node
whose children are ordered. If a Q node have l children, denoted as

j1 , j2 , . . . , jl−1 , jl ,

then the irreducible intervals are

[j1 , j2 ], [j2 , j3 ], . . ., [jl−1 , jl ].

The contained intervals can be represented by P nodes. Thus each P node


corresponds to a irreducible interval I and Π(I) is the set of leaf nodes reach-
able by this P node. Since [1, n] is always an interval, we get a single (con-
nected) PQ tree.
We next explore the construction of the PQ tree from the irreducible in-
tervals. In fact, the irreducible interval algorithm (Algorithm (10)) can be
modified to also produce the PQ tree representation of the interval, [1, n].
The process involves constructing the PQ tree bottom-up. In the process
more than one PQ tree may be under construction. At each stage, the root
of each of the PQ tree under construction maintains information about the
interval it represents, through pointers

u-ptr and l-ptr,

to the sequence being scanned. Note that it is adequate to maintain the


pointers only of the roots. So when a node becomes the child of another
node, the pointers become redundant and are removed.
This is best explained through an example illustrated in Figure 10.11 on

s = 8 9 1 4 6 3 5 2 7.

As i moves down from 9, the first irreducible intervals,

[4, 7], [4, 8] and [4, 9]

are emitted at i = 4, which are shown as solid rectangles in Figure 10.11(1-3).


Consider the irreducible intervals in increasing order of their sizes:

[4, 7], [4, 8] and [4, 9].

1. First [4, 7] is processed (Figure 10.11(1)). j = 7 has no pointers, s[7] is


collected as a child node. Further,

j = 6 down to j = 4
Permutation Patterns 301

have no pointers, so
s[6], s[5], s[4]
are collected as children. These 4 children are assembled together as a
P node as shown in (1).
j = 7 maintains a unidirectional pointer, u-ptr, to this P node and
j = 4 maintains a bidirectional pointer, l-ptr. Both are shown as dashed
curves in the figure.
In other words the interval spanned by the P node is captured through
the u-ptr and the l-ptr. The u-ptr of j = 7 and the l=ptr of j = 4 are
updated to point to the constructed P node as shown in (1).
2. Next consider irreducible interval [4, 8] (Figure 10.11(2)). j = 8 has no
pointers, so s[8] is collected as a child node. But j = 7 has a u-ptr
pointing to the P node which points to j = 4 via the bidirectional l-ptr.
Hence the P node and s[8] are collected as children.
As there are only two elements a Q node is constructed with these two
as children. This is shown in (2). The u-ptr of j = 8 and the l=ptr of
j = 4 are updated to point to the constructed Q node as shown in (2).
3. Similarly irreducible interval
[4, 9]
is processed and is shown in Figure 10.11(3).
Next at j = 1, two irreducible intervals [1, 2] and [1, 9] are emitted. They
are also considered in the increasing order of their sizes.
1. First [1, 2] is processed.
j = 2 has no pointers and clearly j = 1 has no pointers either, so s[2]
and s[1] are collected as children. Since there are only two children, a
Q node is constructed with these two as children as shown in (4).
The u-ptr of j = 2 and the l=ptr of j = 1 are updated to point to the
freshly constructed Q node.
2. Next [1, 9] is processed.
j = 9 has a u-ptr to a Q node that points to j = 4 via the l-ptr. So the
Q node is assembled as a child.
The next considered is j = 3 (to the immediate left of the l-ptr of the
Q node). This has no pointers, s[3] is assembled as a child.
Next j = 2 (immediate left of j = 3) is considered. This has a u-ptr
pointing to a Q node, whose l-ptr points to 1. Thus this Q node is
assembled as a child and the scanning stops.
Since there are three children a P node is constructed with these three
children as shown in (4).
This completes the example. Figure 10.13 describes another example. Here
we illustrate a case when j = 2 at Figure 10.13(4) has a l-ptr (but no u-
ptr). In this case s[2] will be collected as a sibling, not child, as shown in
Figure 10.13(5).
302 Pattern Discovery in Bioinformatics: Theory & Algorithms

Algorithm 11 The PQ Tree Construction


constructPQTree(i, k, J[])
//J[] is a k-dim array; the irreducible
//intervals are [i, J[1]], [i, J[2]], . . . , [i, J[k]]
FOR l = 1 TO k DO
T mp ← φ, j ← J[l], sblng ← F ALSE
WHILE j 6= i DO
IF s[J[l]]’s u-ptr 6= N IL
Place the node N , the ptr points to, in T mp
Let jt be cell pointed to by l-ptr of N
j ← jt − 1

Remove pointers of N //now they are redundant

ELSEIF s[J[l]]’s l-ptr 6= N IL


j ← J[l] − 1 //s[J[l] must point to a Q node
Let the Q node be NQ ; sblng ← T RU E
ELSEIF //all pointers are NIL
Create a leaf node s[J[l]]
Add this node to T mp
ENDWHILE
IF |T mp| > 2 create a P node T
The pointers in T mp are made the children of T
ELSE create a Q node T
The pointers in T mp are made the ordered children of T
IF sblng=TRUE make T the leftmost child of NQ
ENDFOR
The correctness of the algorithm follows from the following lemma.

LEMMA 10.10
At every iteration, j has no more than 1 pointer. The pointer is either a u-ptr
or a l-ptr.

PROOF Assume cell j has two u-ptrs, then there are two irreducible
intervals of the form [·, j]. Clearly one is contained in the other, hence must be
a child (or descendent) of the other. By the step shown as a boxed statement
in the pseudocode of Algorithm (11), the pointers of the child (or descendent)
have been removed, leading to a contradiction. Similarly j can not have
multiple l-ptrs.
Next assume that cell j has a u-ptr and an l-ptr. Then they must have a
parent Q node and by the boxed statement of the algorithm, the children’s
pointers are removed, leading to a contradiction.
Permutation Patterns 303

j 5 6 7 8 9 j 5 6 7 8 9

2
4 5 4 5
6 3 6 3

891 4 63527 891 4 63527


↑ [4, 7] ↑ [4, 8]
i=4 i=4

(1) (2)

j 5 6 7 8 9

7
2
4 5
6 3

891 4 63527
↑ [4, 9]
i=4

(3)

FIGURE 10.11: The PQ Tree Algorithm on s = 8 9 1 4 6 3 5 2 7. See


Figure 10.12 for continuation of this example and text for further details.
304 Pattern Discovery in Bioinformatics: Theory & Algorithms

j 2 3 4 5 6 7 8 9 j 2 3 4 5 6 7 8 9

1
7 7
8 9 8 9
2 2
4 5 4 5
6 3 6 3

8 9 1463527 8 9 1463527
↑ [1.2] ↑ [1, 9]
i=1 i=1

(4) (5)
1
7
8 9
2
4 5
6 3

891463527

(6)

FIGURE 10.12: Continuation of the example of Figure 10.11.


Permutation Patterns 305

j 5 6 7 8 9 10 j 3 4 5 6 7 8 9 10

1
0 6
8 9
3 5 2 4 3 5 2 4

789 3 524061 7 8 93524061


↑ [4, 7] ↑ [2, 3]
i=4 i=2

(1) (3)

j 5 6 7 8 9 10 j 2 3 4 5 6 7 8 9 10

1 1
0 6 0 6
78 9
3 5 2 4 3 5 2 4

789 3 524061 7 8 93524061


↑ [4, 10] ↑ [1, 2]
i=4 i=1

(2) (4)

FIGURE 10.13: The PQ Tree Algorithm on s = 7 8 9 3 5 2 4 0 6 1. See


Figure 10.14 for the continuation of the example and see text for further
details.
306 Pattern Discovery in Bioinformatics: Theory & Algorithms

j 2 3 4 5 6 7 8 9 10

1
0 6
78 9
3 5 2 4

7 8 93524061
↑ [1, 10]
i=1

(5)

FIGURE 10.14: Continuation of the example of Figure 10.13.

10.6.2.1 Time complexity of algorithms (10) and (11)


What is a irreducible interval good for? Firstly, the other (reducible) per-
mutation patterns can be constructed from irreducible patterns and secondly
there is only a small number of irreducible patterns. See Section 13.4 for an
exposition on boolean closure. We leave the proof of the following as Exer-
cise 116 for the reader.

THEOREM 10.2
(Irreducible intervals theorem) Consider s, a permutation of integers
1, 2, . . . , n. Let I be the set of all intervals on s and let M be the set of all
irreducible intervals on s. Then the following statements hold.

1. M is the smallest set such that

I = B(M ).

2. M is unique. In other words there does not exist M ′ with |M ′ | ≤ |M |,


such that I = B(M ′ ).

5
3. The size of M is bounded by n, i.e.,

|M | < n.

5 This was first proved by Heber and Stoye.


Permutation Patterns 307

Is the bound of n on the size of M tight? We go back to the example at


the beginning of this section (where we showed that I = O(n2 )):

s = 1 2 3 4 . . . n.

Then
M = {{1, 2}, {2, 3}, . . . , {n − 1, n}}.
Thus |M | = (n − 1) and the bound is tight.
A maximal permutation pattern is relevant in the context of multiply ap-
pearing characters or patterns that appear only in a subset (not necessarily
all) of the collection of sequences. Now it is easy to see that both the algo-
rithms take O(n) time. It is clear that Algorithm (10) is linear in the size of
the output. Since the number of irreducible intervals is no more than n, the
algorithm takes O(n) time.
It is easy to see in Algorithm (11) that each cell is scanned once. The
number of internal nodes is bounded by n. Thus the algorithm takes O(n)
time.

10.7 Applications
Genes that appear together consistently across genomes are believed to be
functionally related: these genes in each others’ neighborhood often code for
proteins that interact with one another suggesting a common functional asso-
ciation. However, the order of the genes in the chromosomes may not be the
same. In other words, a group of genes appear in different permutations in
the genomes [MPN+ 99, OFD+ 99, SLBH00]. For example in plants, the ma-
jority of snoRNA genes are organized in polycistrons and transcribed as poly-
cistronic precursor snoRNAs [BCL+ 01]. Also, the olfactory receptor(OR)-
gene superfamily is the largest in the mammalian genome. Several of the
human OR genes appear in cluster with ten or more members located on al-
most all human chromosomes and some chromosomes contain more than one
cluster [GBM+ 01].
As the available number of complete genome sequences of organisms grows,
it becomes a fertile ground for investigation along the direction of detecting
gene clusters by comparative analysis of the genomes. A gene g is compared
with its orthologs g ′ in the different organism genomes. Even phylogenetically
close species are not immune from gene shuffling, such as in Haemophilus
influenzae and Escherichia Coli [WMIG97, SMA+ 97]. Also, a multicistronic
gene cluster sometimes results from horizontal transfer between species [LR96]
and multiple genes in a bacterial operon fuse into a single gene encoding multi-
domain protein in eukaryotic genomes [MPN+ 99].
308 Pattern Discovery in Bioinformatics: Theory & Algorithms

If the function of genes say g1 g2 is known, the function of its correspond-


ing ortholog clusters g2′ g1′ may be predicted. Such positional correlation of
genes as clusters and their corresponding orthologs have been used to predict
functions of ABC transporters [TK98] and other membrane proteins [KK00].
Domains are portions of the coding gene (or the translated amino acid se-
quences) that correspond to a functional subunit of the protein. Often, these
are detectable by conserved nucleic acid sequences or amino acid sequences.
The conservation helps in a relative easy detection by automatic motif discov-
ery tools. However, the domains may appear in a different order in the distinct
genes giving rise to distinct proteins. But, they are functionally related due to
the common domains. Thus these represent functionally coupled genes such
as forming operon structures for co-expression [TCOV97, DSHB98].
Next we present two case studies: these were carried out mainly by Oren
Weimann and discussed in [LPW05].

10.7.1 Case study I: Human and rat


In order to build a PQ tree for human and rat whole genome compar-
isons the output of a program called SLAM [ACP03] was used: SLAM is
a comparative-based annotation and alignment tool for syntenic genomic se-
quences that performs gene finding and alignment simultaneously and predicts
in both sequences symmetrically. When comparing two sequences, SLAM
works as follows: Orthologous regions from the two genomes as specified by a
homology map are used as input, and for each gene prediction made in the hu-
man genome there is a corresponding gene prediction in the rat genome with
identical exon structure. The results from SLAM of comparing human (NCBI
Build 31, November 2002) and rat (RGSC v2, November 2002) genomes,
sorted by human chromosomes has been used in the following analysis. The
data in every chromosome is presented as a table containing columns: Gene
name, rat coords, human coords, rat coding length, human coding length and
# Exons.
There were 25,422 genes predicted by SLAM, each gene appears exactly
once in each of the genomes. Each one of the 25,422 genes is mapped to an
integer, thus, the human genome becomes the identity permutation

1, 2, 3, . . ., 25422,

and the rat genome becomes a permutation of

1, 2, 3, . . ., 25422

obtained from the SLAM output table. The full mapping can be found in:
http://crilx2.hevra.haifa.ac.il/∼orenw/MappingTable.ps.
Ignoring the trivial permutation pattern involving all the genes, there are
only 504 interesting maximal ones out of 1,574,312 permutation patterns in
this data set. In Figure 10.15 a subtree of the Human-Rat whole genome PQ
Permutation Patterns 309

1997
1998

2017

2018
2019

2025

2026
2027

2040

2041
2042
2043
2044
2045

2122

2123
2124
2125
FIGURE 10.15: A subtree of the common maximal permutation pattern
PQ tree of human and rat orthologous genes.

tree is presented. This tree corresponds to a section of 129 genes in human


chromosome 1 and in rat chromosome 13. By the mapping, these genes appear
in the human genome as the permutation:

(1997 − 2125)

and in the rat genome as the permutation:

(2043 − 2041, 2025 − 2018, 2123 − 2125, 2122 − 2044, 2040 − 2026, 2017 − 1997).

Another subtree of the Human-Rat whole genome PQ tree, corresponding


to a section of 156 genes in human chromosome 17 and in rat chromosome 10
is
((21028 − 21061) − (21019 − 21027) − (21018 − 20906)).
The neighboring genes PMP22 and TEKTIN3 (corresponding to 21014 and
12015) are functionally related genes as explained in [BMRY04].
Figure 10.16 shows a few more common gene clusters of human and rat.

10.7.2 Case study II: E. Coli K-12 and B. Subtilis


Here a PQ tree obtained from a pairwise comparison between the genomes
of E. Coli K-12 and B. Subtilis is discussed. The input data is from NCBI
GenBank, in the form of the order of COGs (Clusters Of Orthologous Groups)
and their location in each genome.
The data can be found in http://euler.slu.edu/∼goldwasser/cogteams/data
as part of an experiment discussed by He and Goldwasser in [HG04], whose
goal was to find COG teams. They extracted all clusters of genes appearing in
both sequences, such that two genes are considered neighboring if the distance
between their starting position on the chromosome (in bps) is smaller than a
chosen parameter δ > 0. One of their experimental results, for δ = 1900 was
the detection of a cluster of only two genes: COG0718, whose product is an
uncharacterized protein conserved in bacteria, and COG0353, whose product
310 Pattern Discovery in Bioinformatics: Theory & Algorithms

Human chromosome 1:
ABC DE F GH I J
Rat chromosome 13:
J I H GDBF E C A
A 1988 − 2013
B 2014 − 2021
C 2022 − 2036
A D 2037 − 2039
E 2040 − 2118
F 2119 − 2121
G 2122 − 2128
B C D GH I J
H 2129 − 2130
I 2131 − 2141
E F J 2142 − 2153.
(1) 66 genes cluster.

Human chromosome 9:
A 55 56 57 58 59 60 61 62 C
Rat chromosome 5:
A C A 57 59 55 60 56 62 58 61 C
A 12745 − 12754
B 12755 − 12762
55 62 C 12763 − 12791
(2) 47 genes cluster.
Human chromosome 10:
ABC DE F
Rat chromosome 17:
E CA F B D
A 13544 − 13553
B 13554 − 13556
C 13557 − 13562
D 13563
E 13564 − 13573
A F F 13574
(3) 31 genes cluster.
FIGURE 10.16: Examples of common gene clusters of human and rat. See
text for details.
Permutation Patterns 311

is a recombinational DNA repair protein. They conjecture that the function


of COG0353 might give some clues as to the function of COG0718 (which is
undetermined).
Here PQ trees of clusters of genes appearing in both sequences are built.
Two genes are considered neighboring if they are consecutive in the input data
irrespective of the distance between them. There are 450 maximal permuta-
tion patterns out of 15,000 permutation patterns.

DNA repair genes. Here we mention a particularly interesting cluster:

(COG2812 − COG0718 − COG0353).

The product of COG2812 is DNA polymerase III, which according to [BHM87]


is also related to DNA repair. The PQ tree clearly shows that COG0718,
whose function is undetermined is located between two genes whose function is
related to DNA repair. This observation further contributes to the conjecture
that the function of COG0718 might be also related to DNA repair. Note that
the reason that COG2812 was not clustered with COG0718 and COG0353
in [HG04] is because the distance between COG2812 and COG0718 is 1984
(> δ = 1900).

10.8 Conclusion
Although permutation patterns have been studied more recently than sub-
string patterns, their usefulness can not be underestimated. The notion of
maximality in this new context is particularly interesting since it provides a
purely combinatorial way of cutting down on the output size without com-
promising on any information content. We end the chapter by reiterating the
dramatic reduction in the output size simply by the use of maximality on two
biological data sets.

Number of Number of
all patterns maximal patterns

human & rat 1, 574, 312 504

E. Coli K-12 & B. Subtilis 15, 000 450

More sophisticated models such as permutation patterns with fixed gaps


[Par07b], patterns with gaps of bounded size are being studied with appli-
cations to phylogenetic studies [KPL06, Par06] and other interesting prob-
lems [LKWP05]. Again, the burning question continues to be:
312 Pattern Discovery in Bioinformatics: Theory & Algorithms

How significant is the discovered permutation pattern?


We address this in the next chapter.

10.9 Exercises
Exercise 99 (Maximality) Prove that if p2 is nonmaximal with respect to
p1 , then p1 ⊂ p2 . Is the converse true? Why?

Exercise 100 (Multiplicity) Is it possible that


|Lp1 | 6= |Lp2 |,
when p2 is nonmaximal with respect to p1 ?
Hint: (1) Let quorum K = 2 and s = a b c d e b c a . . . . . . a b c d e. Then con-
sider
p1 = {d, e} and p2 = {a, b, c, d, e}.
Is p1 nonmaximal with respect to p2 ? Observe that p1 occurs only two times
but p2 occurs three times.
(2) Let K = 2 and s = a b c d b a c . . . . . . a b c a b c d . . . . . . a b c d a b c.
Then consider
p1 = {a, b, c} and p2 = {a(2), b(2), c(2), d}.
How many times does p2 occur? p1 occurs two times each in the first and
third occurrence of p2 , and, four times in the second occurrence of p2 . Is p1
nonmaximal with respect to p2 ?

Exercise 101 Consider the permutation patterns shown in Figure 10.1. Which
of these are maximal? Give the PQ tree representation of the maximal pat-
terns.
Hint: p = a-b-c-d-e occurs at locations 1, 6 and 11 on the input s. Can every
other pattern be deduced from p?

Exercise 102 (Multiple PQ trees) Let p ∈ P>1 be maximal and be defined


as
p = {σ1 (c1 ), σ2 (c2 ), . . . , σl (cl )},
occurring in K ′ locations. Assuming the elements of p do not occur else-
where in the input, how many PQ trees may be required to represent all the
nonmaximal patterns? Discuss.
Permutation Patterns 313

Hint: Does the following PQ tree capture all the nonmaximal patterns of p
given by Equation (10.4) in Section 10.3.2?

x c

d e a b

Note that only one leaf node is labeled with c, although the multiplicity of c
is 2 in p. In the following example, a single PQ tree cannot represent the two
nonmaximal patterns {a, b, c, d} and {c, d, e, f } in the two occurrences:

o1 = c b d a g e f d c ,

o2 = d g c a b c d e f .

However, in general is the problem well-defined? Note that a universal PQ


tree, i.e., a PQ tree with a single P node as the root node, represents all
possible permutations of the elements of p.
Also, in general, a PQ tree represents more ‘occurrences’ than it actually
encodes for. Can this ‘excess’ be minimized? This leads to the specification
of the Minimal Consensus PQ Tree Problem.

Exercise 103 Recall p of Equation (10.4):

p = {a, b, c(2), d, e, x}.

p has exactly three occurrences on the input given as

o1 = d e a b c x c,
o2 = c d e a b x c,
o3 = c x c b a e d.

Enumerate all the nonmaximal patterns with respect to p, assuming they do


not occur elsewhere in the input.

Exercise 104 Can the running time of the Parikh Mapping-based algorithm
discussed in this chapter be improved?

Hint: Is it possible to reduce factor (log t)2 to log t in the time complexity?
Recall that the tags are assigned in increasing order. Let at stage j, tj be the
largest assigned integer to a tag. If the newly encountered tag (t′1 , t′2 ) is such
314 Pattern Discovery in Bioinformatics: Theory & Algorithms

that, t′1 , t′2 ≤ tj , then the first of the tag pair can be stored in an array and
directly accessed in O(1) time reducing one of the O(log t) factors to O(1).
This can be made possible if all the entries in the Ψ array are known in
advance. This can be simply done by a linear scan of the input with the L-sized
window and recording all the distinct numbers in Ψ that are generated by the
L-sized window. Let the largest number encountered be t∗0 . Note that t∗0 ≤ L,
by the choice of the window size. Then the tag values are assigned starting
with t∗0 , thus every new number tnew encountered is such that tnew ≤ t∗0 .

Exercise 105 (Algorithm generalization) In some applications, the mul-


tiplicity of the permutation pattern is not used, thus a permutation pattern

p = {a, b, c}

could have an occurrence o given

o = a b b c a.

In such cases, only


Π(o) = {a, b, c}
is of consequence. Under this condition Ψ is a binary vector and the only
possible pairs at level 0 of the naming tree are

(0, 0), (0, 1), (1, 0), (1, 1).

How is the Parikh Mapping-based algorithm improved based on this simplifying


assumption?

Exercise 106 What is the maximum number of maximal permutation pat-


terns in a sequence of length n with quorum k = 3? Give arguments to
support your claim.

Exercise 107 1. Refer to the definitions in Section 10.5.1. Is it possible


that
l(i, j) = u(i, j)
for some j > i? Why?

2. Given s and a fixed i, show that if j(> i) is not potent with respect to
i, then [i′ , j] is not an interval for all i′ ≤ i.
Hint: (1) Are all elements of s distinct? (2) See the proof of Lemma (10.4).
Permutation Patterns 315

Exercise 108 (Monotone functions) Prove statements (U.1),(L.1)of Lemma (10.5).

• (U.1) For 1 ≤ i < n, u(i, j) is a nondecreasing function over j = (i + 1)


up to n.

• (L.1) For 1 ≤ i < n, l(i, j) is a nonincreasing function over j = (i + 1)


up to n.

Hint: Use proof by contradiction.

Exercise 109 Give arguments to show that statement (F.2) of Lemma (10.6)
is equivalent to the following statement: If

u(i−1, j1 ) = u(i, j1 ) and l(i−1, j1 ) = l(i, j1 ) and


u(i−1, j2 ) = u(i, j2 ) and l(i−1, j2 ) = l(i, j2 ),

for
1 < i < j1 < j2 ≤ n,
then
f (i − 1, j1 ) − f (i − 1, j2 ) = f (i, j1 ) − f (i, j2 ).
Prove the above statement or (F.2).

Exercise 110 Prove the following statements.

• (U.2) Let 1 < i < j1 < j2 ≤ n.. If

u(i, j1 ) < u(i, j2 ), and


u(i − 1, j1 ) = u(i − 1, j2 ),

then, f (i′ , j1 ) > 0 for 1 ≤ i′ < j1 .

• (L.2) Let 1 < i < j1 < j2 ≤ n.. If

l(i, j1 ) > l(i, j2 ), and


l(i − 1, j1 ) = l(i − 1, j2 ),

then, f (i′ , j1 ) > 0 for 1 ≤ i′ < j1 .

Hint: Use proof by contradiction.

Exercise 111 Refer to the algorithm discussion in Section 10.5.2.


316 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. Enumerate the steps involved in moving from i = 2, shown in Fig-


ure 10.5, to i = 1 shown here.

9 ↓ ↓ ↓ ↓ ↓
8 j 2 3 4 5 6 7 8
u(i,j)
7 s[j] 6 5 8 7 3 9 2
6
u(i, j) 6 6 8 8 8 9 9
4 6 5 8 7 3 9 2
l(i, j) 4 4 4 4 3 3 2
i=14

l(i,j) 3
R(i, j) 2 2 4 4 5 6 7
2 r(i, j) 1 2 3 4 5 6 7
j
2 4 6 8
p f (i, j) 1 0 1 0 0 0 0

(a) s = 4 6 5 8 7 3 9 2. (b) i = 1.

2. Give a pseudocode for the algorithm of Section 10.5.2 assuming p-list is


stored explicitly.

3. Give a pseudocode for the algorithm of Section 10.5.2 assuming p-list is


computed on-the-fly from the U and L lists.

Exercise 112 Give a pseudocode description, along the lines of the subrou-
tines in Algorithm (9), of the following three routines:

1. deleteU List(i, j, v, Hd),

2. deleteLList(i, j, v, Hd), and,

3. ScanpList(i, j, v, Hd).

Exercise 113 Let s of length n be such that each element is distinct and

Π(s) ⊂ {1, 2, . . . , N }

and 1, N ∈ Π(s), for some N > n. Does Algorithm (8) work for this input s?

Exercise 114 (Monge array) An m × n matrix M , is said to be a Monge


array if, for all i′ , i, j, j ′ such that

1 ≤ i′ < i ≤ m, and 1 ≤ j < j ′ ≤ n,

the following holds (called the Monge property):

M [i′ , j] + M [i, j ′ ] ≥ M [i, j] + M [i′ , j ′ ].


Permutation Patterns 317

1. For the upper-diagonal matrices below, does the Monge Property hold (when
ever the matrix elements are defined)?

j′ j j′ j j′ j
6 6 7 7 9 9 9 9 9 6 4 4 2 2 1 1 1 1 0 1 1 2 1 3 2 1 0
4 7 7 9 9 9 9 9 4 4 2 2 1 1 1 1 0 2 3 4 4 3 2 1
7 7 9 9 9 9 9 7 2 2 1 1 1 1 0 4 5 5 4 3 2
2 9 9 9 9 9 i′ 2 2 1 1 1 1 i′ 0 6 6 5 4 3 i′
9 9 9 9 9 9 1 1 1 1 0 7 6 5 4
1 8 8 8 i 1 1 1 1 i 0 6 5 4 i
8 8 8 8 3 3 0 4 3
3 5 3 3 0 1
5 5 0
(a)Mu . (b)Ml . (c)Mf .

2. For
1 ≤ i′ < i < j < j ′ ≤ n
show that
f (i′ , j) + f (i, j ′ ) ≥ f (i, j) + f (i′ , j ′ ).

Hint: 1. A row i in Mu is u(i, j), in Ml is l(i, j), and in Mf is f (i, j), for
j ≥ i for some sequence s.
2. Show that

u(i′ , j) + u(i, j ′ ) ≥ u(i, j) + u(i′ , j ′ ) and


l(i′ , j) + l(i, j ′ ) ≤ l(i, j) + l(i′ , j ′ ).

Exercise 115 Let [i, j1 ] and [j1 , j2 ] be intervals. If

[i, j], j1 < j < j2 ,

is an interval, then show that [j1 , j] is an interval.

Hint: Enumerate the cases possible and explore each case.

Exercise 116 (Irreducible intervals) Let I be the set of all intervals on


some s of length n. Let M be the the smallest set such that

I = B(M ).

Then show the following statements hold.


318 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. M is the set of irreducible intervals, and

2. M is unique, and

3. |M | < n.

Hint: (1) Use the reduced partial order graph G(I, Er ).


(2) Use proof by contradiction.
(3) Use the PQ tree structure to prove the result. Consider a PQ tree T that
uniquely captures the given permutations. Each Q node with k children can
be replaced by a stack of (k − 1) nodes (shown as solid nodes below in (c))
to give a transformed PQ tree T ′ . An irreducible pattern corresponds to a
node in this transformed tree. The number of internal nodes is bounded by
the number of leaf nodes in a tree.
In the following figure (a) shows the PQ tree capturing Example (4). Each
internal node corresponds to a irreducible pattern. (b) shows the PQ tree
capturing Example (5) and (c) shows the transformed PQ tree of (b) where
each dark node encodes a irreducible pattern. Each internal node corresponds
to a irreducible pattern.

1 1
1
8
7
8 6 78 6

2 3 4 5 6 7 2 3 4 5 =⇒ 2 3 4 5

(a) T = T . (b) T. (c) T ′ .
s2 = 3 5 2 4 7 6 8 1 s3 = 3 5 2 4 6 7 8 1

Exercise 117 Consider Algorithm (10). If the scanning of the input is switched
to left-to-right (instead of right-to-left as in the current description), does the
algorithm emit the same irreducible intervals? Why?

Exercise 118 (Compomers) [B0̈4] Consider a left-to-right ordered DNA


sequence
s = 0AC C GT T 1
where symbols 0 and 1 denote the leftmost and rightmost end respectively of
the sequence. If one or more C is removed from s, the following fragments
arise:
f1 = 0 A , f2 = C G T T 1, f3 = 0 A C, f4 = G T T 1.
Permutation Patterns 319

Here

Π′ (f1 ) = {0, A},


Π′ (f2 ) = {C, G, T (2), 1},
Π′ (f3 ) = {0, A, C},
Π′ (f4 ) = {G, T (2), 1}.

Similarly, one or more A, G and T can be cleaved giving rise to more frag-
ments.
Assume an assay DNA technology (MALDI-TOF mass spectrometry [B0̈4]),
that reads only Π′ (f )(also called a compomer) for each fragment f . In this
example, the complete collection of compomers is as follows:
1. (cleaved by C): {0, A}, {C, G, T (2), 1}, {0, A, C}, {G, T (2), 1},
2. (cleaved by A): {C(2), G(T )2, 1},
3. (cleaved by G): {0, A, C(2)},{T (2), 1},
4. (cleaved by T ): {0, A, C(2), G}, {T, 1}, {0, A, C(2), T, G}, {1}
Is it possible to reconstruct the original s from this collection of compomers?

Exercise 119 (Local alignment of genomes) [OFG00] The local align-


ment of nucleic or amino acid sequences, called the multiple sequence align-
ment problem, is based on similar subsequences; however the local alignment
of genomes is based on detecting locally conserved gene clusters. For example
the chunk of genes
g1 g2 g3
may be aligned with
g3′ g1′ g2′ .
Such an alignment is never detected in subsequence alignments.
Give a formal definition of the local alignment of genomes problem and
discuss a method to solve it.
Hint: Define a measure of gene similarity (to identify gene orthologs) to
define the alignment problem.

Exercise 120 (Common connected components) Consider the follow-


ing application: In metabolic pathways different dependencies might exist with
the enzymes, metabolites, proteins and so on whereas the players may still
be the same. In other words, different organisms or tissue cultures may give
evidence of different relations. Thus ‘permutation patterns’ on metabolic path-
way networks yields a collection of enzymes and proteins that could possibly
have different pathways.
320 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. The task is to formulate the problem that can address the above.
2. Consider the following two examples. What are the common permuta-
tions (clusters) in both the graphs in (a) and (b)?
F F
G E G E

D D
H H

J J
A A
I I
B B
C C

(a)
F A I A E

F
E B C H B C
G
G D H I D

(b)

Hint: 1. This is a problem of finding permutations (or intervals) on graphs.


The problem can be formalized as follows:
Common Connected Component Problem: Given n graphs

G(V, Ei ), 1 ≤ i ≤ n,

and a quorum K > 1 and a size m, the problem is to find all the maximal

V ′ ⊆ V with |V ′ | ≥ m,

such that for at least k ≥ K graphs,

G(V, Ei1 ), G(V, Ei2 ), . . . , G(V, Eik ),

each induced graph, 1 ≤ j ≤ k is connected

G(V ′ , (Ei′j ⊆ Eij )).

To date, the following best addresses this problem.

Partitive families and decomposition trees [MCM81, CR04]: A family


of subsets of a finite set V is partitive if and only if
a. it does not contain the empty set, but contains the trivial subsets of V
(singletons and V itself), and
b. the union, intersection, and differences of any pair of overlapping sub-
sets is in the family. (two subsets of a same set overlap if they intersect
but neither is a subset of the other)
Permutation Patterns 321

Although partitive families can be quite large (even exponentially large), they
have a compact, recursive representation in the form of a tree, where the leaves
are the singletons of V , namely, the decomposition tree:

THEOREM 10.3
(Decomposition theorem) [MCM81] There are exactly three classes of
internal nodes in a decomposition tree of a partitive family.
a. A P rime node is such that none of its children belongs to the family,
except for the node itself.
b. A Degenerate node is such that every union of its children belongs to
the family.
c. A Linear node is given with an ordering on its children such that a union
of children belongs to the family if and only if they are consecutive in
the ordering.

2. The solutions to the two examples are shown below encoded as an


annotated PQ tree. The P node has the same interpretation as in the regular
PQ tree and the Q node is actually a graph as shown below. The set of
leafnodes reachable from any subgraph of a graph of the Q node and a hollow
P node is a solution to the problem.

D I J A
E F
I H G D
G H

C F
E B A C

(a) (b)

Comments
I particularly like this chapter since it is a nice example of the marriage
of elegant theory and useful practice. Usually, the idea of maximality of
patterns is very important and in the case of permutation patterns, it is also
nonobvious. Further, it beautifully fits in with the PQ tree data structure and
Parikh mapping, both having been well studied independently in literature.
Chapter 11
Permutation Pattern Probabilities

Probability is like gravity,


you can’t fight it.
- Sonny in Miami Vice

11.1 Introduction
Just as it is reasonable to compute the odds of seeing a string pattern in
a random sequence, so is the case with permutation patterns. We categorize
permutation patterns as (1) unstructured and (2) structured.
The former usually refers to the case where these patterns (or clusters) are
observed in sequences, usually defined on fairly large alphabet sets.
The structured permutations refer to PQ trees, that is the encapsulation of
the common internal structure across all the occurrences of the permutation
pattern. The question here is regarding the odds of seeing this structure (as
a PQ tree) in a random sequence.

11.2 Unstructured Permutations


Consider the following problem: What is the chance of seeing an n-mer
consisting of

i1 number of A’s,

i2 number of C’s,

i3 number of G’s and

i4 number of T’s,

with i1 + i2 + i3 + i4 = n,

in a random strand of DNA?

323
324 Pattern Discovery in Bioinformatics: Theory & Algorithms

This is a pattern p where the order does not matter (also called a permu-
tation pattern in Chapter 10) and is written as

p = {A(i1 ), C(i2 ), G(i3 ), T (i4 )}.

We make the simplifying assumption that the multiple occurrences of p do


not overlap.
We construct the discrete probability space (see Section 3.2.1 for the nota-
tion used here)
(Ω, 2Ω , MP )
as follows.1 Let Ω be the set of all possible n-mers on {A, C, G, T}. Let

ωi1 ,i2 ,i3 ,i4

be an n-mer with
i1 number of A’s,
i2 number of C’s,
i3 number of G’s and
i4 number of T’s
and let pX be the probability of occurrence of X where

X = A, C, G or T

with
pA + pC + pG + pT = 1.
Then the probability measure function

MP : Ω → R≥0 ,

is defined as follows:
 
(i1 + i2 + i3 + i4 )!
MP (ωi1 ,i2 ,i3 ,i4 ) = (pA )i1 (pC )i2 (pG )i3 (pT )i4
i1 ! i2 ! i3 ! i4 !
 
n!
= (pA )i1 (pC )i2 (pG )i3 (pT )i4 .
i1 ! i2 ! i3 ! i4 !
In particular, if the four nucleotides, A, C, G, T are equiprobable, then the
formula simplifies to
  
n! 1
MP (ωi1 ,i2 ,i3 ,i4 ) = .
i1 ! i2 ! i3 ! i4 ! 4n

1 See Chapter 3 for the definitions of the terms.


Permutation Pattern Probabilities 325

How do we get this formula? And, does it satisfy the probability mass condi-
tions (see Section 3.2.4)?
To address these curiosities, we pose the following general question where
we use m instead of 4.
What is the number of distinct strings where each has exactly i1 number
of x1 ’s, i2 number of x2 ’s, . . ., im number of xm ’s?
This is not a very difficult computation, but we also need to show its relation
to a probability mass function. Hence we take a ‘multinomial’ view of the
problem: It turns out that this number is precisely the multinomial coefficient
in combinatorics. This is one of the easiest ways of computing this number
and we study that in the next section. The summary of the discussion is as
follows.
1. MP (ωi1 ,i2 ,i3 ,i4 ) is computed from the multinomial coefficient (divided
by mn ), and
2. P (Ω) = 1 follows from Equation (11.3).
If
i1 + i2 + . . . + im = n,
then each string is of length n. As an example, let

m = 2, n = 3 and i1 = 1.

Then i2 = 2 and there are only three distinct strings given as

ω1 = x1 x2 x2 ,
ω2 = x2 x1 x2 and
ω3 = x2 x2 x1 .

11.2.1 Multinomial coefficients


We briefly digress here to recall the multinomial formula which is the ex-
pansion of
(x1 + x2 + . . . + xm )n .
2
Let Ψ be an m-dimensional array of nonnegative integers such that
m
X
Ψ[i] = n.
i=1

Let Sig(m, n) be the set of all possible signatures for the given m and n. For
instance,
Sig(2, 2) = {[2, 0], [1, 1], [0, 2]}.

2 This is also called the Parikh vector and is discussed in Chapter 10.
326 Pattern Discovery in Bioinformatics: Theory & Algorithms

Further, let Ψ[index] be denoted by iindex . Thus

if Ψ = [1, 0], then i1 = 1 and i2 = 0.

Relating the terms to the discrete probability space, there is an injective


mapping

I : Ω → Sig(m, n),

where

I(ωi1 ,i2 ,...,im ) = (Ψ = [i1 , i2 , . . . im ]).

For m > 0 and n ≥ 0, the following can be verified (with some patience):

 
X n!
n
(x1 +x2 +. . .+xm ) = xi11 xi22 . . . ximm (11.1)
i1 ! i2 ! . . . im !
Ψ∈Sig(m,n)

The number

 
n!
, (11.2)
i1 ! i2 ! . . . im !

corresponding to each Ψ, is called the multinomial coefficient. Let this number


Permutation Pattern Probabilities 327

be denoted by M C(Ψ). See the following three cases as illustrative examples:


m = 2 and n = 2

(x1 + x2 )2 = x21 + 2x1 x2 + x22


Ψ M C(Ψ) strings
[2, 0] 1 x1 x1
[1, 1] 2 x1 x2 , x2 x1
[0, 2] 1 x2 x2

m = 2 and n = 3

(x1 + x2 )3 = x31 + 3x21 x2 + 3x1 x22 + x32


Ψ M C(Ψ) strings
[3, 0] 1 x1 x1 x1
[2, 1] 3 x1 x1 x2 , x1 x2 x1 , x2 x1 x1
[1, 2] 3 x2 x2 x1 , x2 x1 x2 , x1 x2 x2
[0, 3] 1 x2 x2 x2

m = 3 and n = 2

(x1 + x2 + x3 )2 = x21 + x22 + x23 + 2x1 x2 + 2x2 x3 + 2x1 x3


Ψ M C(Ψ) strings
[2, 0, 0] 1 x1 x1
[0, 2, 0] 1 x2 x2
[0, 0, 2] 1 x3 x3
[1, 1, 0] 2 x1 x2 , x2 x1
[0, 1, 1] 2 x2 x3 , x3 x2
[1, 0, 1] 2 x1 , x3 , x3 x1
It follows from Equation (11.1), by setting
x1 = x2 = . . . = xm = 1,
that for a given m and n, the sum of all the multinomial coefficients is mn .
In other words,
 
n
X n!
m =
i1 ! i2 ! . . . im !
Ψ∈Sig(m,n)
X
= M C(Ψ).
Ψ∈Sig(m,n)

Thus,
X M C(m, n)
= 1. (11.3)
mn
Ψ∈Sig(m,n)

Note that the total number of n-mers is also mn .


328 Pattern Discovery in Bioinformatics: Theory & Algorithms

11.2.2 Patterns with multiplicities


We pose two problems related to the one in the last section. Let the alpha-
bet be
Σ = {σ1 , σ2 , . . . , σl , . . . , σm }.

Problem 15 (Permutations with exact multiplicities) Let q be an n-


mer generated by a stationary, i.i.d. source which emits σ ∈ Σ with probability
pσ . What is the probability that q has exactly i1 number of σ1 ’s, i2 number of
σ2 ’s, . . ., il number of σl ’s?

Problem 16 (Permutations with inexact multiplicities) Let q be an n-


mer generated by a stationary, iid source which emits σ ∈ Σ with probability
pσ . What is the probability that q has at least i1 number of σ1 ’s, i2 number of
σ2 ’s, . . ., il number of σl ’s?
In both scenarios,

pσ1 + pσ2 + . . . + pσl + . . . + pσm = 1,

Further, let
k = i1 + i2 + . . . + il ≤ n.

Permutations with exact multiplicities. In the first problem, the char-


acters σ1 , σ2 , . . . , σl
1. occur in some k positions on the n-mer with σ1 occurring i1 times, σ2
i2 times, . . ., σl occurs il times, and
2. do not occur on the remaining n − k positions.
Using Equation (11.2), the number of such distinct occurrences is given by
  
n k!
. (11.4)
k i1 ! i2 ! . . . il !

For each distinct occurrence, the probability of its occurrence is given as:

(pσ1 )i1 (pσ2 )i2 . . . (pσl )il ((1 − pσ1 )(1 − pσ2 ) . . . (1 − pσl ))n−k . (11.5)

For a specific choice, denoted as j, of k out of n locations on the string,


let Ej denote the event that σ1 , σ2 , . . . , σl appear only in these k locations
satisfying the stated constraints (i.e., exactly i1 number of σ1 ’s and so on).
Then it can be verified that

Ej1 ∩ Ej2 = ∅, (11.6)


Permutation Pattern Probabilities 329

i.e., the events are disjoint for any pair j1 6= j2 . The proof of this statement
is left as Exercise 121 for the reader.
Next, using Equations (11.4) and (11.5), the answer to the first problem,
denoted as Pi1 +i2 +...+il , is given as
  
n (i1 + i2 + . . . + il )!
Pi1 +i2 +...+il =
i1 + i2 + . . . + il i1 ! i2 ! . . . il !
(pσ1 )i1 (pσ2 )i2 . . . (pσl )il ((1 − pσ1 )(1 − pσ2 ) . . . (1 − pσl ))n−k .

Permutations with inexact multiplicities. The second problem is a lit-


tle more complex. We first define a set of l-tuples as follows:

i′1 + i′2 + . . . i′l = k′ ≤ n, 



i1 ≤ i′1 ,
 


 

′ ′ ′ ′
C = (i1 , i2 , . . . , il ) i2 ≤ i2 , .
, . . . ,

 


 
il ≤ i′l
 

For
j = (i′1 , i′2 , . . . , i′l ) ∈ C,
let Ej denote the event that σ1 , σ2 , . . . , σl occur exactly i′1 , i′2 , . . . , i′l times
respectively. Then
Ej1 ∩ Ej2 = ∅, (11.7)
i.e., the events are disjoint for any pair j1 6= j2 (∈ C). We leave the proof of
this as an exercise for the reader (Exercise 121).
Since the events are disjoint, the answer to the second problem, denoted as
Pi′1 +i2 +...+il , is obtained using the solution to Problem 1:
X
Pi′1 +i2 +...+il = Pi′1 +i′2 +...+i′l (11.8)
(i′1 ,i′2 ,...,i′l )∈C

The reader is also directed to [DS03, HSD05] for results on real data and
generalizations to gapped permutation patterns.

11.3 Structured Permutations


The last section dealt with cases where an element of the alphabet occurs
significantly many times in the input. But consider a scenario where an ele-
ment of the alphabet occurs only a few times but the size of the alphabet is
fairly large.
330 Pattern Discovery in Bioinformatics: Theory & Algorithms

We have also seen in Chapter 10 that a permutation pattern can be hier-


archically structured as a PQ tree. Thus it is meaningful to ask: Given a
permutation pattern q, where

q = {σ1 , σ2 , . . . , σl },

that occurs k times in the input, what is the p-value, pr(T, k), of its maximal
form given as a PQ tree T ?
What does it mean to compute this probability? We give an exposition
based on explicit counting below.

11.3.1 P -arrangement

An arrangement of size k is defined to be a string (or permutation) of some


k consecutive integers

i, i + 1, i + 2, . . . , i + k − 1,

and its inversion is obtained by reading the elements from right to left. For
example, q1 and q2 shown below are arrangements of sizes 5 and 3 respectively.

q1 = 5 2 4 3 1 and its inversion is 1 3 4 2 5,


q2 = 4 5 6, and its inversion is 6 5 4.

Recall the following notation from Section 2.8.2:

Π(q1 [1..5]) = {1, 2, 3, 4, 5},


Π(q2 [1..3]) = {4, 5, 6}.

Let q be an arrangement of size k. Recall from Section 10.5 that

[k1 ..k2 ], 1 ≤ k1 < k2 ≤ k,

is an interval in q if for some integers i < j, the following holds:

Π(q[k1 ..k2 ]) = {i, i + 1, i + 2, . . . , j}.

Note that
[1..k]
Permutation Pattern Probabilities 331

is always an interval, hence is called the trivial interval. Every other interval
is nontrivial. See the examples below for illustration.
q1 q2
interval Π(q1 [k1 ..k2 ]) size
[k1 ..k2 ] interval Π(q2 [k1 ..k2 ]) size
[k1 ..k2 ]
[3..4] 52 43 1 {3, 4} 2
[2..4] 5 243 1 {2, 3, 4} 3 [1..2] 45 6 {4, 5} 2
[1..4] 5243 1 {2, 3, 4, 5} 4 [2..3] 4 56 {5, 6} 2

[1..5] 52431 {1, 2, 3, 4, 5} 5 [1..3] 456 {4, 5, 6} 3

An arrangement of size k is a P -arrangement if it has no nontrivial intervals.


For example, q1 and q2 are not P -arrangements but q3 and q4 are where
q3 = 1 2, q4 = 2 4 1 3.
Now, we are ready to state the central problem of the section.

Problem 17 (P -arrangement) What is the number of P -arrangements of


size k?

11.3.2 An incremental method


How does one count such arrangements? Does this have a closed form
formula? We give an exposition below that will set the stage for computing
this number using an incremental method.
We first clarify the term position in an arrangement and the interpretation
of an empty symbol, φ, in the arrangement. The following example best
explains the ideas. For example consider the following:
position 1 2 3 4 5 6 7
q= 42φ1653
The nontrivial intervals defined by interval [2..4] and [5..6] in q are shown
below:
4 2φ1 653
42φ1 65 3,
with
Π(q[2..4]) = {1, 2} and
Π(q[5..6]) = {5, 6}.
332 Pattern Discovery in Bioinformatics: Theory & Algorithms

Base cases. Note that for k = 1, the problem is not defined since the size
of an interval is at least two. For k = 2, the P -arrangements are as follows:

12

and its inversion


2 1.
For k = 3, the number of P -arrangements is zero, since every arrangement of
the three numbers has at least one nontrivial interval as shown below.

1 32 ,
2 1 3,
3 2 1.

For k = 4, the P -arrangements are:

3142

and its inversion


2 4 1 3.

Principle 1. Can we obtain a P -arrangement of size 5 using

q4 = 3 1 4 2,

the P -arrangement of size 4? If element 5 is inserted at the start end of q4 as

5 3 1 4 2,

then we have the nontrivial interval as shown below

5 3142 .

Similarly, adding element 5 at the other end will give a nontrivial interval.
Also, if element 5 is inserted next to 4 as shown below

3 1 5 4 2,

we get the nontrivial interval


3 1 5 4 2.
Similarly, element 5 inserted to the right of 4 will again give a nontrivial
interval. However, the following has no nontrivial intervals

q5 = 3 5 1 4 2.

Thus we can derive a general principle of construction of a P -arrangement of


size k + 1 from a P -arrangement of size k which is stated as follows.
Permutation Pattern Probabilities 333

LEMMA 11.1
Let q be a P -arrangement of
1, 2, . . . , k.

Let q be constructed from from q by inserting element k + 1 at any of the
k − 3 positions in q that is not an end position and not adjacent to element
k. Then q ′ is a P -arrangement of 1, 2, . . . , k, k + 1.

Does the converse of Lemma (11.1) hold? In other words, is it true that
removing element k + 1 from any P -arrangement of

1, 2, . . . , k, k + 1

gives a P -arrangement of 1, 2, . . . , k? Consider the following P -arrangement


of size 5:
42513
However deleting element 5 gives the following with a nontrivial interval shown
boxed:
4 21 3
In other words, this P -arrangement of size 5 could not have been constructed
incrementally from one of size 4 and the incremental construction will miss
such P -arrangements.

Principle 2∗∗ . We take a closer look at this arrangement of 1, 2, 3, 4. and


the two nontrivial intervals are as shown below:

4 21 3

It turns out the the smallest nontrivial interval is nested in the others, i.e.,
this interval is a subset of the others. An interval [i11 . . . i12 ] is a subset of
[i21 . . . i22 ], written as
[i11 . . . i12 ] ⊂ [i21 . . . i22 ]
if and only if the following holds:

i21 ≤ i11 < i12 ≤ i22 .

In fact, in this example, all the intervals are nested, since otherwise the ar-
rangement of size 5 will not be a P -arrangement. This observation can be
generalized as the following lemma.

LEMMA 11.2
Let q be a P -arrangement of

1, 2, . . . , k, k + 1.
334 Pattern Discovery in Bioinformatics: Theory & Algorithms

Let q ′ be obtained from q by removing the element k + 1 from q. If q ′ is not a


P -arrangement, then the smallest nontrivial interval is nested in every other
nontrivial interval.

For example, consider the following P -arrangement


q = 9 1 3 6 4 11 7 5 8 2 10
Removing the element 11, gives the following intervals, that can be arranged
as multiple (2 in this example) sequences of nested intervals as shown below:

9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10
9 1 3 6 4 φ 7 5 8 2 10 9 1 3 6 4 φ 7 5 8 2 10

The sizes of the intervals in q are: 4, 5, 6, 7, 8, 9, 10, and there are two
intervals of size 5. We observe the following about nested intervals in an
arrangement.

LEMMA 11.3
Let q be an arrangement such that it has r nested nontrivial intervals
[i11 ..i12 ] ⊂ [i21 ..i22 ] ⊂ .. ⊂ [ir1 ..ir2 ].
Then for each j = 1, 2, . . . , r,
q[ij1 ..ij2 ]
is a P -arrangement of size ij2 − ij1 + 1.

This gives a handle on designing a method for counting (as well as enumer-
ating) arrangements with nested nontrivial intervals. Consider the following
arrangement of size 10 with nested intervals as shown:

8 10 6 3 1 4 2 7 5 9

Note that the smallest interval is a P -arrangement, and when the smallest
interval is replaced by its extreme (largest here) element, shown in bold below,

8 10 6 4 7 5 9
Permutation Pattern Probabilities 335

the next smallest interval is again a P -arrangement. We call this process


of replacing an interval by its extreme element as telescoping. The next tele-
scoping is shown below.
8 10 7 9
Two more examples of successive telescoping is shown below. Notice that at
each stage the smallest interval is a P -arrangement.

68 43 152 7 → 68 4152 7 → 6857

31 4 65 2 → 31 45 2 → 3142

Successive telescoping for the previous example, where an interval is used


instead of a single extreme, is shown below:
9 1 3 6 4 7 5 8 2 10 9 1 3 6 4 7 5 8 2 10

9 1 3 4-7 8 2 10 9 1 3 4-7 8 2 10
↓ ↓
9 1 3-7 8 2 10 9 1 3 4-8 2 10
↓ ↓
9 1 3-8 2 10 9 1 3-8 2 10
↓ ↓
9 1 2-8 10 9 1 2-8 10
↓ ↓
9 1-8 10 9 1-8 10
↓ ↓
1-9 10 1-9 10
↓ ↓
1-10 1-10

Putting it all together. We have identified the two properties that an


arrangement q of
1, 2, . . . , k
must satisfy, so that a P -arrangement of
1, 2, . . . , k, k + 1
may be successfully constructed from q. Now, we are ready to state the
following theorem.

THEOREM 11.1
(P -arrangement theorem) Let q be a P -arrangement of size k + 1. Let
q ′ be obtained by replacing an extreme element (either k + 1 or 1) from its
position j in q, with the empty symbol. Then only one of the following holds.
336 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. q ′ has no nontrivial intervals, i.e. q ′ is a P -arrangement, or,

2. every nontrivial interval [i1 . . . i2 ] is such that

i1 < j < i2 .

For convenience, we introduce a new terminology (telescopic interval size)


that we will use later. Recall that the r nested intervals are given as

[i11 ..i12 ] ⊂ [i21 ..i22 ] ⊂ .. ⊂ [ir1 ..ir2 ].

Let the size, ij (1 ≤ j ≤ r), of each interval be denoted by

ij = ij2 − ij1 + 1.

Then
i1 < i2 < . . . < ir−1 < ir .
Then the telescoping sizes, xj (1 ≤ j ≤ r), are defined as:

x1 = i1 ,
xj = ij − ij−1 + 1,

and are written as:


x1 ← x2 ← . . . ← xr−1 ← xr .

11.3.3 An upper bound on P -arrangements∗∗


We identify the following functions that can be used to estimate an upper
bound on the number of P -arrangements.

1. P a(k): Let P a(k) denote the number of P -arrangements of size k.

2. S(u, l): Let S(u, l), u ≥ l, denote the number of arrangements of size u
that has only nested intervals and the size of the smallest interval is l.

3. N st(k), Nst’(k):

(a) Let N st(k) denote the number of arrangements of size k that have
only nested intervals. Then N st(k) can be defined in terms of S(·, ·)
as follows:
k
X
N st(k) = S(k, l). (11.9)
l=2

The summation is used to account for all possible values of l, the


size of the smallest interval.
Permutation Pattern Probabilities 337

(b) However, we also wish to count the number of potential places


in the arrangements where element k + 1 can be inserted. The
positions are all in the smallest interval of size l. We denote by
N st′ (k) the number of positions that the element k + 1 can be
inserted in.
k
X
N st′ (k) = S(k, l)(l − 1). (11.10)
l=2

Consider N st′ (k) of Equation (11.10). Note that the smallest interval
of size l may contain the element k, thus placing the element k + 1 in
this interval gives a nontrivial interval resulting in an over estimation of
the number of P -arrangements. Thus using Theorem (11.1) we get the
following:

P a(k) ≤ (k − 4)P a(k − 1) (using Theorem (11.1)(1))


+ N st′ (k − 1). (using Theorem (11.1)(2))

Estimating S(u, l). Now we are ready to define S(u, l), u ≥ l > 1, in terms
of P a(u′ ) where u′ < u.
First, consider the case when there is exactly one nontrivial interval. The
single nontrivial interval is of size l. Then we can consider a P -arrangement
of size u − l + 1 and each position in this arrangement can be replaced by yet
another P -arrangement of the remaining l elements giving the following:

S(l, l) = P a(l)

Next consider the case where


u l.
See Exercise 127 to study the simple scenario where the number of nested
intervals (r) and the size of each interval is known. However, in the general
scenario this number (r) and the size of each interval is not known.

Simple scenario. We then study yet another simple scenario where we


compute all arrangements of size u

1. that have only nested intervals,

2. the smallest interval is of size (l = i1 ) > 2,

3. each successive interval differs from the next in size by at least two and

4. the largest nontrivial interval is of size ik1 or ik2 where

l = i1 < ik1 < ik2 < ir = u.


338 Pattern Discovery in Bioinformatics: Theory & Algorithms

Note that the telescoping sizes (see Section 11.3.2) are as follows:

x1 = i1 .
xk1 = ir − ik1 + 1,
xk2 = ir − ik2 + 1.

Let the number of such arrangements be N and our task is compute N . Note
that we know neither the exact number of nested intervals nor the size of each
interval. But this does not matter.
P -arrangement of size > 2. Note that a P -arrangement of size l > 2 is such
that the largest element is never at the ends of the arrangement. Thus this
arrangement (or its inversion) can be inserted within another P -arrangement
without producing intervals that are nested.
First, we compute N1 , the number of arrangements with the size of largest
nontrivial interval as i1 . Using the principle used in Exercise 127 we obtain

N1 ≤ xk1 P a(xk1 )S(ik1 , l).

Similarly, we get
N2 ≤ xk2 P a(xk2 )S(ik2 , l).
Thus the required number is

N = N1 + N2
≤ xk1 P a(xk1 )S(ik1 , l) + xk2 P a(xk2 )S(ik2 , l)
X
= xj P a(xj )S(ij , l)
j=k1 ,k2
X
= P a(∆ + 1)(∆ + 1)S(u − ∆, l).
∆=u−ik1 ,u−ik2

Back to computation. This sets the stage for computing the number of
arrangements where the size of the largest nontrivial interval takes all possible
values. In other words,
∆ = 1, 2, . . . , (u − l).
Thus, in the general case, we get
u−l
X
S(u, l) ≤ 2(∆ + 1)P a(∆ + 1)S(u − ∆, l). (11.11)
∆=1

Thus, to summarize,
k−2
X
P a(k) ≤ (k − 4)P a(k − 1) + S(k − 1, l)(l − 1). (11.12)
l=2
Permutation Pattern Probabilities 339

11.3.3.1 A dynamic programming solution


Note that P a(·) is defined in terms of N st(·) which is defined in terms of
S(·, ·) which is again defined in terms of P a(·). So is this a circular definition
or is it possible to successfully compute P a(·)?
Note that in Equation (11.11), ∆ < u holds and S(u, l) is defined in terms
of P a(k) where k < u and thus can be computed using dynamic programming.
When the optimal solution to a problem can be obtained from optimal so-
lutions of its subproblems, the problem can be usually solved efficiently by
maintaining a table that successively computes the solutions to the subprob-
lems. Thus this table avoids unnecessary re-computations and this approach
is called dynamic programming.3
Here if P a(k) can be obtained from P a(k′ ) where k′ < k, then it is possible
to compute P a(k) in increasing value of k.
We recall the recursive formulation of the problem of computing P a(k) the
number of P -arrangements of 1, 2, . . . , k.
1. For k > 1,

P a(2) = 2,
P a(3) = 0,
P a(4) = 2,
k−2
X
P a(k) ≤ (k − 4)P a(k − 1) + S(k − 1, l)(l − 1).
l=2

2. For k ≥ l > 1,

S(l, l) = P a(l),
u−l
X
S(u, l) ≤ (∆ + 1)P a(∆ + 1)S(u − ∆, l).
∆=1

3. For k > 1.
k
X
N st′ (k) ≤ S(k, l)(l − 1).
l=2

Figure 11.1 shows the order in which the functions can be evaluated. For
convenience, it has been broken down into four phases as shown. To avoid
clutter, the functions P a(·), S(·, ·) and N st′ (·) also refer to the one, two and
one dimensional arrays respectively that store the values of the functions as
shown in the figure.

3 The ‘programming’ refers to the particular order in which the tables are filled up and does
not refer to ‘computer programming’.
340 Pattern Discovery in Bioinformatics: Theory & Algorithms

P a(k) S(u, l) N st′ (k)


k u k
... ... 7 6 5 4 3 2 ←l
↓ √ ↓ √ √ ↓
2 2 2

3 3 3
I
4 4 4
5 5 5
6 6 6
7 . ... ... .
k√ ... ... 7 6 5 4 3 √2 ←l k

2√ 2 √√ √2
3 3 3

4 4 4 II
5 5 5
6 6 6
7 . ... ... .
k√ ... ... 7 6 5 4 3 √2 ←l k

2√ 2 √√ √2
3√ 3 √√√ √3
4√ 4 4 III
5 5 5
6 6 6
7 . ... ... .
k√ ... ... 7 6 5 4 3 √ 2 ←l k

2√ 2 √√ √2
3√ 3 √√√ √3
4√ 4 √√√√ √4 IV
5√ 5 5
6 6 6
7 . ... ... .

FIGURE 11.1: The three arrays that store the values of P a(·), S(·, ·) and
N st′ (·). The order in which the different functions are evaluated and stored
in the arrays in a dynamic programming approach
√ is shown above in the first
four phases: I, II, III and IV. The check ( ) entry indicates that the function
has been evaluated at that point. See text for details.
Permutation Pattern Probabilities 341

1. Phase I: P a(2) is evaluated as a base case for function P a(·).


(a) Note that S(2, 2) = P a(2)).
(b) Note that N st′ (2) = S(2, 2)).
(c) Finally P a(3) is evaluated, which happens to be a base case for
P a(·).
2. Phase II:
(a) Note that S(3, 3) = P a(3). Then S(3, 2) is evaluated, which de-
pends on S(2, 2).
(b) Next N st′ (3) is evaluated, which depends on row u = 3 of the
two-dimensional array that stores S(·, ·).
(c) Finally, P a(4) is evaluated, which depends on P a(3) and N st′ (3).
3. Phase III:
(a) Note that S(4, 4) = P a(4). Then S(4, 3) is evaluated, which de-
pends on column l = 3 of array S(·, ·) array. Then S(4, 2) is eval-
uated, which depends on column l = 2 of array S(·, ·).
(b) Next N st′ (4) is evaluated, which depends on row u = 4 of array
S(·, ·).
(c) Finally P a(5) is evaluated, which depends on P a(4) and N st′ (4)).
4. Phase K (K > 1): We can now generalize the evaluations in phase K.
(a) Note that S(K + 1, K + 1) = P a(K + 1). Then S(K + 1, j), as j
takes values from K down to 2, is evaluated. At each value of j,
S(K + 1, j) depends on column j entries evaluated up to this point
in the array S(·, ·).
(b) Next N st′ (K + 1) is evaluated whose values depend on the entries
in row u = K + 1 of the array S(·, ·).
(c) Finally, P a(K + 2) is evaluated whose value depends on P a(K + 1)
and N st′ (K + 1).

11.3.4 A lower bound on P -arrangements


We can obtain an easy lower bound on P a(k) (which we will see is already
quite high), or an underestimate of the number of P -arrangements by only
using Principle 1 of Theorem (11.1). We achieve this using the following
recurrence equations (see Section 2.8):
P a(2) = 2,
P a(3) = 0,
P a(4) = 2,
P a(k) ≥ (k − 4)P a(k − 1), for k > 4.
342 Pattern Discovery in Bioinformatics: Theory & Algorithms

Solving this recurrence form (see Section 2.8) gives


P a(k) ≥ 2(k − 4)!, for k ≥ 4. (11.13)
The over- and underestimates on the number of P -arrangements are given by
Equations (11.12) and (11.13) respectively.4

11.3.5 Estimating the number of frontiers


Let T be a PQ tree (see Section 10.3.1) with k leaf nodes labeled by k
integers
Σ = 1, 2, . . . , k.
Recall that this tree T
1. has k leaf nodes, and,
2. has N internal nodes (some P nodes and some Q nodes), with
N < k.

Recall from Section 10.6.2 that a PQ tree T encodes I(T ), a collection of


subsets of Σ as shown in Equation (10.5). Let an arrangement q of elements
of Σ be a frontier of T if for every set
I ∈ I(T ),
there exists
1 ≤ i1 < i2 ≤ k
such that
Π(q[i1 ..i2 ]) = I.
Let
F r(T ) = {q | q is a frontier T }. (11.14)

Two PQ trees T and T are equivalent, denoted
T ≡ T ′,
if one can be obtained from the other by applying a sequence of the following
transformation rules:
1. Arbitrarily permute the children of a P -node, and
2. Reverse the children of a Q-node.
There is yet another view to frontiers of a PQ tree. The frontier of a tree T ,
denoted by F (T ), is the arrangement obtained by reading the labels of the
leaves from left to right. An alternative definition for F r(T ) is as follows:
F r(T ) = {F (T ′ ) | T ′ ≡ T }. (11.15)

4 However, an easy upper bound is P a(k) < k!.


Permutation Pattern Probabilities 343

What is the size of F r(T )? The burning question of this section is: Given
a PQ tree T with k leaf nodes, what is the size of F r(T )?
In other words, what is the number of arrangements that encode exactly
the same subsets of Σ as T ?
We define #(A), for each node A of T as follows. Let node A in the PQ
tree T have c children A1 , A2 , . . . , Ac . Then

 1 if A is a leaf node,
2 cj=1 #(Aj )
Q
#(A) = if A is a Q node, (11.16)
P a(c) cj=1 #(Aj )
Q
if A is a P node.

Recall that P a(c) is the number of P -arrangements of size 1, 2, . . . , c. We next


claim the following.
#(Root(T )) = |F r(T )|,
where Root(T ) is the root node of the PQ tree T . This is best explained
through a concrete example shown in Figure 11.3.
Note that the number of leaf nodes is 7 in this example. We begin by
first relabeling the nodes in the left to right from 1 to 7 as shown. The
arrangements
1234567
and its inversion
7 6 5 4 3 2 1,
are clearly frontiers. The others are computed as follows. For each P node
of T with c children, P a(c) gives the number of possible arrangements of the
children. For each Q node, the number of possible arrangements is only two
as illustrated in the figure.
See Example (133) for another example and #(A) for each node A is com-
puted as shown in Figure 11.2 for this T .

A practical solution. In the last section we derived lower and upper


bounds for P a(c). Let the under and over bound of P a(c) be given as

L(c) ≤ P a(c) ≤ U (c).

Then Equation (11.16) can be re-written as:



 = 1 if A is a leaf node,
2 cj=1 #(Aj )

= Q
if A is a Q node,
#(A) (11.17)
L(c) cj=1 #(Aj )
Q 

if A is a P node.

 Qc
< U (c) j=1 #(Aj )

See Figure 11.4 for an example.


344 Pattern Discovery in Bioinformatics: Theory & Algorithms

2 (8.1) = 16 D

2 (2.2.1.1) = 8 C 9

2(1.1.1.1) = 2 2(1.1) = 2
A B 7 8

1 2 3 4 5 6
FIGURE 11.2: Computation of #(X) for each node X in the PQ tree
using Equation (11.16). Note that the internal nodes are labeled A, . . . , D
and #(A) = #(B) = 2, #(C) = 8, and #(D) = 16.

C Node A
3142 2413

Node B
A B 567 765

Node C
a b c d e f g AB BA
a b c d e f g 1 2 3 4 5 6 7
(1) The input PQ tree T . (2) Numbering the leaf nodes (3) The possible
& labeling the internal nodes. arrangmeents.

1234567 abcdefg
3142567 765 2413 cadbefg gfe bdac
3142765 5672 413 cadbgfe efgb dac
2413567 7653142 bdacefg gfecadb
2413765 5673142 bdacgfe efgcadb
7654321 gfedcba
(4) The 10 possible arrangements. (5) Arrangements in the input alpahabet.
FIGURE 11.3: Different steps involved in computing |F r(T )| are shown
in (1), (2) and (3). The different arrangements are shown in (4) and (5) above.
Permutation Pattern Probabilities 345
8(16)! < z < 4(20)!
C

2(16)! < z < 20! z=2


A B

20 11

FIGURE 11.4: The P node A has 20 children and the Q node B has 11
children. #(X) for node X is given by z. The lower and upper estimates have
been used for each node.

11.3.6 Combinatorics to probabilities


We connect the final dot by computing the probabilities from the functions
that we have evaluated so far in the preceding sections.
To understand this we make a comparison to DNA patterns. Consider the
following question: What is the probability, pr(x), of seeing a pattern x which
has n1 purines and n − n1 pyramidines amongst all n length patterns?
For this we compute Nx which is the number of n length patterns with n1
purines and n − n1 pyramidines. The total number of patterns of length n is

2n .

Then, the probability is given as


Nx
pr(x) = .
2n
First we define compatibility as follows. Let q be a permutation of some
finite set Σ and let T be a PQ tree with its leaf nodes labeled by elements
of Σ. Further, let q be of size |Σ| and let T have |Σ| leaf nodes. Then T is
compatible with q and vice-versa, if and only if the following holds.

q ∈ F r(T ).

We are now ready to pose our original question along the lines of the earlier
one: What is the probability, pr(T ), of a PQ tree T which has n leaf nodes,
labeled by integers 1, 2, . . . , n, being compatible with a random permutation of
1, 2, . . . , n?
For this we compute NT which is the number of permutations that are
compatible with T . Note that

NT = |F r(T )|.
346 Pattern Discovery in Bioinformatics: Theory & Algorithms

The total number of permutations of length n is

n!

Then, the probability is given as

NT |F r(T )|
pr(T ) = = . (11.18)
n! n!

An alternative view. Let T be a tree with n leaf nodes. We label the leaf
nodes by integers
1, 2, . . . , n
in the left to right order.5 Let q be a random permutation of integers

1, 2, . . . , n.

See Section 5.2.3 for a definition of random permutation. Then the probability,
pr(T ), of the occurrence of the event

q ∈ F r(T )

is given by
|F r(T )|
pr(T ) = . (11.19)
n!

Generalization. We have computed the probability of seeing a structured


permutation pattern T two times (K = 2) in Equations (11.18) and (11.19).
This can be generalized to any K as
K−1
(pr(T )) . (11.20)

11.4 Exercises
For the problems below, see Section 5.2.4 for a definition of random
strings and Section 5.2.3 for a definition of random permutations.

Exercise 121 (Disjoint events) Consider the discussion in Section 11.2.2.


Show that the events are disjoint in each of that cases below. In other words,

1. show that Equation (11.6) holds and

5 In fact, the leaves could be labeled in any order (as long as it is a bijective mapping) and

the arguments still hold.


Permutation Pattern Probabilities 347

2. show that Equation (11.7) holds.


Hint: Note that the event is defined as one with i1 number of σ1 ’s and i2
number of σ2 ’s, and i3 number of σ3 ’s, ..., and il number of σl ’s. However, if
the event is defined as one with i1 number of σ1 ’s or i2 number of σ2 ’s, or i3
number of σ3 ’s, ..., or il number of σl ’s, then would the equations hold?

Exercise 122 (Effect of boundary) Let

|Σ| = n.

If two input strings s1 and s2 , with no multiplicities, defined on Σ, of length


n are circular, then show that for 1 ≤ k < n,

pk = pn−k ,

where pk is the probability of seeing a set of size k appear together is s1 and


s2 .

Exercise 123 (P -arrangement) For k > 1, let P a(k) denote the number
of P -arrangements of size k. Then what is S(k), the number of permutations
(arrangements) of size k that has exactly one nontrivial interval of size l < k?
Hint: An interval of size l can be treated as a single number, that can then
be expanded to admit its own P -arrangement. Then does the following hold?

S(k) = P a(l)P a(k − l + 1).

Exercise 124 (P -arrangement structure) Let q be a P -arrangement of


integers
i, i + 1, . . . , j − 1, j.
1. Which is the P -arrangement for which i and/or j occur(s) at the end
position(s)?
2. Show that the elements i and j do not occur at the end positions of q
when j − 1 > 1.

Exercise 125 Prove statements (1) and (2) of Theorem (11.1).


Hint: (1) Use proof by contradiction. (2) If the nontrivial intervals are not
nested, is it possible that q also has a nontrivial interval?
348 Pattern Discovery in Bioinformatics: Theory & Algorithms

Exercise 126 (Enumeration) Construct all arrangements of elements 1, 2, 3, 4, 5


such that each has

1. exactly one nontrivial interval and this interval is of size 4.

2. exactly one nontrivial interval and this interval is of size 2.

3. the smallest nontrivial interval is of size 2.

Hint: Invert the telescoping and do an appropriate renumbering. In the tables


below, row marked (1a) shows the position, as underlined, of the arrangement
in row (1) that is expanded (inverse of telescoping) in the next row, along
with appropriate renumbering. Row (2a) has a similar interpretation.

1. Row (1) shows the P -arrangements of 1, 2 and row(2) shows its inversion.
21 (1)
21 21 (1a)

4253 1 3524 1 5 3142 5 2413

(inversion)
12 (2)
12 12 (2a)

3142 5 2413 5 1 4253 1 3524

2. Row (1) shows the P -arrangements of 1..4 and row(2) shows its inver-
sion.
3142 (1)
3142 3142 3142 3142 (1a)

43 152 34 152 4 21 53 4 12 53 31 54 2 31 45 2 415 32 415 23

(inversion)
2413 (2)
2413 2413 2413 2413 (2a)

32 514 23 514 2 54 13 2 45 13 35 21 4 35 12 4 251 43 251 34

3. When l = 4, then the only possible nested intervals are size 4 and 5.
When l = 3, the number is zero since P a(3) = 0. When l = 2, let
r be the number of intervals, then for each case the possible (nested)
interval sizes are given in the following table. For example when r = 3,
2 < 3 < 5 is a possible configuration of the interval sizes and in the
Permutation Pattern Probabilities 349

inverted telescoping process the P -arrangement sizes are 3, 2 and 2.


l=2
r=2 r=3 r=4
2<5 2<3<5 2<3<4<5
2←42←2←32←2←2←2

2<4<5
2←3←2

The first column (2 intervals) is already answered in part 1. The sec-


ond column gives zero P -arrangements since p-arrangement of size 3 is
involved. See case r = 4 below.

21
23 1
231 231
32 41 2 43 1
3241 3241 2431 2431
34 251 4 23 51 2 45 31 25 34 1

34 2 5 1 4 23 5 1 2 45 3 1 2 5 34 1

21
3 12
312 312
4 21 3 41 32
4213 4213 4132 4132
5 32 14 53 12 4 51 34 2 514 23

5 32 1 4 5 3 12 4 5 1 34 2 5 1 4 23

Exercise 127 Let S ′ (ir , i1 ) be the number of arrangements of size ir such


that each has r nested intervals of sizes

(l = i1 ) < i2 < . . . < ir−1 < (ir = u),

and for 1 < j < r,


xj > 2,
350 Pattern Discovery in Bioinformatics: Theory & Algorithms

where

x1 = i1 ,
xj = ij − ij−1 + 1, for r ≥ j > 1.

1. Show the following:


r
Y
S ′ (ir , i1 ) ≤ xj P a(xj ).
j=1

2. If xj > 1, for all j, what is S ′ (ir , i1 )?

Hint: 1. Use the ideas of Exercise 126. 2. Consider the scenario when xj
and xj+1 are both of size 2. See also Exercise 126(3) for an illustration.

Exercise 128 Enumerate all the P -arrangements of elements 1 . . . 5.

Hint: Use Theorem (11.1).

Principle 1 Principle 2

2413 3 12 4
inversions
24153 31524 35142 42513

Exercise 129 (Rearrangements) Let q be an arrangement of k elements


that has r > 1 nested intervals and the size of the smallest one is l.

1. Then the elements 1 and k do not occur together in the smallest interval.

2. Each element i in q is replaced by i + 1 to obtain an arrangement q ′ of


elements 2, 3, . . . , k, k + 1. Show that

[i11 . . . i12 ], [i21 . . . i22 ], . . . , [ir1 . . . ir2 ]

are the r intervals in both q and q ′ .

Prove the two statements.

Hint: (1) Observe that l k. (2) Use proof by contradiction.


Permutation Pattern Probabilities 351

Exercise 130 (Over-counting) Recall the following from Section 11.3.1:

k−2
X
N st′ (k) ≤ S(k − 1, l)(l − 1).
l=2

Consider the following line of argument.


Recall that the smallest interval is of size l. Using statement (1) of Exer-
cise 129, both elements 1 and k − 1 do not occur together in the smallest
nested interval. Then only one of the following holds.

1. (Case 1): If element k − 1 does not occur, element k can be inserted in


any of the l − 1 positions in the smallest interval.

2. (Case 2): If element k − 1 does occur, 1 does not occur and using state-
ment (2) of Exercise 129, element 0 can be inserted in any of the l − 1
positions. The arrangement of elements 0, 1, . . . , k − 1 is simply renum-
bered to elements 1, 2, . . . , k − 1, k.

Thus
k−2
X
N st′ (k) = S(k − 1, l)(l − 1).
l=2

What is incorrect with this line of argument?

Hint: In Case 2, is it possible that this arrangement has already been ac-
counted for? Are all the nested interval sequences accounted for?

Exercise 131 (Correction factors) In Section 11.3.2, an overestimate of


P a(k) is computed. What are the cases that need to be handled to get an exact
estimate of P a(k)?

Hint: How is the counting done if two successive telescopic sizes of the inter-
vals are 2 each? In the arrangements that N st(k) counts, how many are such
that element k occurs in the smallest interval? How are the intervals that are
not strictly nested taken care of?

Exercise 132 (Frontier equivalence) Show that F r(T ) in Equations (11.14)


and (11.15) describe the same set of arrangements.

Exercise 133 Enumerate the frontiers of the PQ tree T shown below. What
is |F r(T )|?
352 Pattern Discovery in Bioinformatics: Theory & Algorithms

C 9

A B 7 8

1 2 3 4 5 6

Hint: Notice that the leaf nodes are already labeled in the left to right order.
Each internal node and the possible arrangements of the children are shown
below.
A B C D
3142 2413 56 65 7A8B B8A7 C9 9C

Exercise 134 Let T be a tree with n leaf nodes which are labeled by integers
1, 2, . . . , n in the left to right order and let q be a random permutation of
integers 1, 2, . . . , n.

1. If T has exactly one Q node and no P nodes, then what is the probability
of the following event:
q ∈ F r(T )?

2. If T has exactly one P node and no Q nodes, then what is the probability
of the following event:
q ∈ F r(T )?

Exercise 135 ∗∗ (Human-rat data) A common cluster of 380 genes in


human chromosome 11 and rat chromosome 1 is shown below with the leaf
nodes being labeled with the genes.

A BC D E K L

F I J

G H
Permutation Pattern Probabilities 353

The labeled nodes of T are to be interpreted as follows.


A: Q node with 146 genes,
B: Q node with 4 genes,
C: Q node with 64 genes,
D. . .E: six nodes with 1 gene each, four Q nodes with 2 genes each,
one Q node with 82 genes,
F: 1 gene,
G: Q node with 5 genes,
H: Q node with 2 genes,
I. . .J: one Q node with 2 genes, 11 nodes with 1 gene each,
K. . .L: twelve with 1 gene each, three Q nodes of 2 genes each,
one Q node with 3 genes, one Q node with 4 genes,
M: Q node with 24 genes.
Compute pr(T ).
Hint: Use Equation (11.17) to get the under- and overestimates of |F r(T )|.

Exercise 136 Argue that Equation (11.20) is the probability of K > 1 occur-
rences of a structured permutation pattern (PQ tree) T .
Hint: How is the probability space defined for K occurrences?
Chapter 12
Topological Motifs

Through science we prove,


through intuition we discover.
- attributed to J. H. Poincare

12.1 Introduction
Due to some unknown reason,1 nature has organized the blue-print of a
living organism along a line. Thus nucleotides on a strand of DNA or amino
acids on a protein sequence (the primary structure) or genes on a chromo-
some are linearly arranged and the study of strings has been an important
component in the general area of bioinformatics.
But sometimes, there is a deviation from this clean organizational simplicity,
for instance, a cell’s metabolic network, as we understand it. A metabolic
pathway is a series of chemical reactions that occur within a cell, usually
catalyzed by enzymes, resulting in the synthesis of a metabolic product that
is stored in the cell. Sometimes, instead of creating such a product, the
pathway may simply initiate another series of reactions (yet another pathway).
Various such metabolic pathways within a cell have a large number of common
components and thus form the cell’s metabolic network. Figure 12.1 shows
an example.
To study and gain an understanding in a domain such as this, one abstracts
the organization of this information as a graph. A graph captures this kind of
complex interrelationships of the entities. Continuing the theme of this book,
we seek the recurring structures in this data.

12.1.1 Graph notation


We enhance the graph notation of Section 2.2 here. A graph G is defined
by a quadruplet (V, E, AV , AE ) as follows:

1. V is the set of vertices.

1 However, speculations abound and there are almost as many theories as there are scientists.

355
356 Pattern Discovery in Bioinformatics: Theory & Algorithms

2. E (⊆ V ×V ) is the set of edges. Each element e ∈ E is usually written as


vi vj (where vi , vj ∈ V ). When the graph is directed, the edge e = vi vj
is to be interpreted as an edge from vi to vj .

3. Let AV be the set of all possible attributes that can be assigned to a


vertex. Then
AV : V → AV .
Thus the attribute of v ∈ V is written as AV (v) or simply att(v).

4. Let AE be the set of all possible attributes that can be assigned to an


edge. Then
AE : E → AE .
Thus the attribute of e ∈ E is written as AE (e) or simply att(e) (or
att(vi vj ) where e = vi vj ).

Usually AV ∩ AE is assumed to be empty. In other words the attributes of a


vertex and that of an edge do not ‘mix’.
The size of G will be denoted by NG and is defined to be

NG = |V | + |E|.

However, for convenience we denote a graph as G(V, E) with vertex set V and
edge set E. The vertex and edge attribute mappings AV , AE are assumed to
be implicit.

12.2 What Are Topological Motifs?


In this section we give an informal, intuitive introduction to topological
motifs.2 Consider the graph in Figure 12.2 with seven connected components
numbered for convenience from (1) to (7). The vertex attributes are repre-
sented by the color of the vertex in the figure. Thus a vertex attribute can be
red, blue, green or black. The edges are directed and the attribute is denoted
by the type of edge (solid or dashed) in the figure.
What are the recurring structures in this graph? For example, a vertex
with an attribute red occurs seven times in the graph. Is the occurrence of
a red vertex always accompanied by another vertex (of a specific attribute)
along with an edge (of a specific attribute)? So we use the natural notion of
maximality when we enumerate these common structures or motifs. A formal
definition of maximal motifs is given in the later sections.

2 In literature they are also called network motifs.


Topological Motifs 357

Figure 12.3 gives the exhaustive list of maximal and connected structures
(subgraphs or motifs) that occurs at least two times in the input. Here con-
nected is defined as follows:

For any pair of vertices v and v ′ there is a path that can be obtained
by ignoring the direction of the edge from v to v ′ .

In this example, each occurrence of the motif is in a distinct connected com-


ponent of the graph. There are seventeen such motifs.

12.2.1 Combinatorics in topologies


We next make the problem a little harder by making all the edges undi-
rected and having the same attribute. However, we maintain the following
simplifying property:

No two adjacent vertices have the same attribute and no two vertices
adjacent to one vertex have the same attribute.

Consider the graph shown in Figure 12.4. Again, the color of the vertex de-
notes its associated attribute and the graph has seven connected components
numbered, for convenience, from (1) to (7).
The exhaustive list of maximal common structures (motifs) on this input
graph is 63 and is shown in Figure 12.5.

Estimating the output size. We spell out the underlying combinatorics


in terms of common structures, in this extreme example. It is not difficult
but tedious. But it is a good exercise that shows the enormity of the task
involved, in a worst case scenario.
Assuming the output is the list of all the maximal motifs and their oc-
currences in the graph, we do a simple exercise of estimating the size of the
output.
Figure 12.7 gives the input size and the details involved in calculating the
output size. This shows that the size of the input (which is simply the sum
of the number of vertices and number of edges) is only 144 whereas the size
of the output is 1448. The size of the output also includes the details of the
occurrences of each motif.
The occurrence of a motif is defined by a subset of vertices and edges in the
input graph. Thus each vertex in the motif defines a list of vertices and each
edge defines a list of edges that correspond to the occurrence of this motif.
Figure 12.8 gives the number of these vertex and edge lists in the graph that
correspond to all the maximal motifs.
358 Pattern Discovery in Bioinformatics: Theory & Algorithms

12.2.2 Input with self-isomorphisms


A graph isomorphism is a bijection f , i.e., a one-to-one and onto mapping,
between the vertices of two graphs G1 (V1 , E1 ) and G2 (V2 , E2 ),

f : V1 → V2 ,

with the property that for v1 , v2 ∈ V , if

(v1 v2 ) ∈ E1

then
(f (v1 )f (v2 )) ∈ E2 .
Two graphs are isomorphic if such a bijection f exists.
A graph G(V, E) is self-isomorphic if f is a bijection as above and there
exists some v such that
v 6= f (v).
We abuse notation here and say that given l > 0, a graph G(V, E) displays
self-isomorphism, if for some nonempty sets

V1 6= V2 ⊂ V

and |V1 | > l there exists a bijection

f : V1 → V2 ,

with the property that for v1 , v2 ∈ V1 , if

(v1 v2 ) ∈ E

then
(f (v1 )f (v2 )) ∈ E.
Self-isomorphism is an important property of a graph, since it can be partic-
ularly confounding to methods that attempt to recognize common structures
in the graph.

Concrete example. We next modify the graph under study by changing


the attributes of some of the vertices to obtain the graph shown in Figure 12.9.
Now the vertices have only two types of attributes.
Now the graph becomes highly self-isomorphic. Figure 12.10 shows the iso-
morphisms in the different connected components of the graph. Components
(1), (4) and (5) are isomorphic to each other and components (2), (3) and (6)
are isomorphic to each other. Thus, for all practical purposes, the graph has
three connected components number (5), (6) and (7).
Figure 12.11 gives an exhaustive list of maximal motifs that occur at least
two times in the graph. Note that a motif can have two occurrences that
overlap. We consider two occurrences to be distinct,
Topological Motifs 359

Secretion

granulosa ESTRADIOL
CHOLESTEROL ESTRONE

Aromtase

PREGNENOLONE PROGESTERONE ANDROSTENEDIONE TESTOSTERONE

theca interna cells


PROGESTERONE ANDROSTENEDIONE

FIGURE 12.1: Synthesis of estrogens in the ovary: Estrogens (estradiol,


estrone) are synthesized via androgen intermediates which takes place in the
granulosa and the theca interna cells of the follicle, each of which is under the
control of a different hormone.

if in the two occurrences there is at least one edge that is not present in
both the occurrences.
In the figure, we count only distinct occurrences. Thus an occurrence of
5(2), 6, 7(3) is to be interpreted as having two distinct occurrences in compo-
nent (5), one occurrence in component (6) and three distinct occurrences in
component (7).
In this scenario, when is a motif maximal? Usually, a subgraph M ′ of a motif
M is considered nonmaximal with respect to (w.r.t.) M . We use the definition
that takes the occurrences as well into account. Thus the occurrences may
actually determine if a M ′ is maximal or not:
1. If the number of distinct occurrences of M and M ′ differ, then M ′ is
maximal w.r.t. M .
2. However if the number of distinct occurrences are the same then M ′ is
nonmaximal w.r.t. M .

12.3 The Topological Motif


In the remainder of the discussion we deal with graphs with undirected edges
and attributes defined only on the vertices. See Exercises 141 and 142 for a
360 Pattern Discovery in Bioinformatics: Theory & Algorithms

(1) (2) (3) (4)

(5) (6)

(7)

FIGURE 12.2: The input graph with 7 connected components where the
edges may be directed and have different attributes. The two attributes are
shown as solid and dashed edges.

systematic reduction of a directed graph with attributes on both edges and


vertices to an undirected graph with attributes only on the vertices. However
this conversion results in a graph with a larger vertex set. But, the methods
presented here can be easily adapted to directed graphs (with edge attributes)
and in fact simpler than the scenario discussed here.
Informally, as we have seen in the last sections, a topological motif is a
‘graph’ that occurs multiple times in a given graph G (or a collection of
graphs). Usually the interest is in a motif (say, M ) that occurs at least K
times in a input graph. This K is usually referred to as a quorum constraint.
This is so called because if a subgraph occurs less than K times, it may not
be of interest. Formally a topological motif and its occurrence is defined as
follows.

DEFINITION 12.1 (topological motif, occurrence, mappings Fi , location


list) Given a graph
G(V, E),

a topological motif is a connected graph

M (VM , EM ),
Topological Motifs 361

|Vm |=4
|Em |=5
N4,5 =0
|Vm |=4
|Em |=4
N4,4 =1
1, 3, 7
|Vm |=4
|Em |=3
N4,3 =2
1, 3, 4, 7 1, 2, 3, 7
|Vm |=3
|Em |=3
N3,3 =1
1, 3, 6, 7
|Vm |=3
|Em |=2
N3,2 =5
1, 3, 5, 6, 7 1, 3, 4, 5, 7 1, 2, 3, 6, 7 2, 3, 4, 5, 7 1, 2, 4, 5, 7
|Vm |=2
|Em |=1
N2,1 =4
1, 2, 3, 4, 5, 7 1, 2, 3, 5, 6, 7 1, 2, 3, 5, 6, 7 1, 2, 3, 4, 6, 7
|Vm |=1
|Em |=0
N1,0 =4
1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7

FIGURE 12.3: Maximal (connected) motifs that occur at least two times
in the input graph of Figure 12.2. Nx,y denotes the number of motifs with x
vertices and y edges and each row shows motifs with x number of vertices and y
number of edges. For example Row 2 shows the maximal motifs with 5 vertices
and 4 edges each. The list of numbers below the motif gives the occurrence list,
for example 1, 3, 7 indicates that the motif occurs in components numbered
(1), (3) and (7).
362 Pattern Discovery in Bioinformatics: Theory & Algorithms

(1) (2) (3) (4)

(5) (6)

(7)

FIGURE 12.4: The input graph with 7 connected components numbered


(1) to (7). All the edges are undirected and have the same attribute. The
attributes of the vertices are given by their color.

where
VM = {u1 , u2 , . . . , up } , p ≥ 1
and is said to occur on

Oi = {vi1 , vi2 , . . . , vip } ⊆ V,

if and only if there is a mapping

Fi : VM → O i ,

such that,
1. for each u ∈ VM ,
att(u) = att(Fi (u)), and

2. for each (u1 u2 ) ∈ EM ,

(u1 u2 ) ∈ EM ⇒ (Fi (u1 )Fi (u2 )) ∈ E.

Let the number of such distinct mappings (Fi ’s) be K ′ . Then the number of
occurrences is K ′ and the occurrence lists are

O 1 , O 2 , . . . , OK ′ .
Topological Motifs 363

|Vm |=4
|Em |=5
N4,5 =6
1, 7 2, 7 3, 7 4, 7 5, 7 6, 7
|Vm |=4
|Em |=4
N4,4 =15
1, 3, 7 1, 4, 7 3, 4, 7 4, 5, 7
|Vm |=4
|Em |=3
N4,3 =16
1, 3, 5, 7 1, 3, 4, 7 2, 3, 4, 7 2, 4, 5, 7
|Vm |=3
|Em |=3
N3,3 =4
2, 4, 6, 7 1, 3, 6, 7 1, 4, 5, 7 2, 3, 5, 7
|Vm |=3
|Em |=2
N3,2 =12
1, 3, 4, 5, 7 1, 2, 3, 6, 7 2, 3, 4, 5, 7 1, 2, 4, 5, 7
|Vm |=2
|Em |=1
N2,1 =6
1, 2, 3, 4, 5, 7 1, 2, 3, 5, 6, 7 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 6, 7
|Vm |=1
|Em |=0
N1,0 =4
1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7 1, 2, 3, 4, 5, 6, 7

FIGURE 12.5: Maximal (connected) motifs that occur at least two times
in the input graph of Figure 12.4. In this example each occurrence of the
motif is in a distinct component of the graph. Nx,y denotes the number of
motifs with x vertices and y edges. Each row shows motifs with x number
of vertices and y number of edges. For example Row 1 shows some maximal
motifs with 4 vertices and 5 edges each.

(1) (2) (3)

FIGURE 12.6: Continuing the example of Figure 12.4. Any two edges
many not be adjacent, leading to disconnected structures, hence the three
structures are not motifs.
364 Pattern Discovery in Bioinformatics: Theory & Algorithms

Input
vertices edges - input size
|V | |E| |V | + |E|
28 36 - 64

Output
vertices edges number occurrences output size
|Vm | |Em | l K′ lK ′ (|Vm | + |Em |)
4 5 6 2 108
4 4 15 3 360
4 3 16 4 448
3 3 4 4 96
3 2 12 5 300
2 1 6 6 108
1 0 4 7 28
63 1448

FIGURE 12.7: The number of distinct maximal (connected) motifs that


occur at least two times in the graph of Figure 12.4 is 63.

Vertex Lists (nV )


Edge Lists
nV =
|Em | l K ′ l|Em |
|Vm | l K ′ l|Vm | K ′ nV
5 6 2 30
4 6 2 24 48
4 15 3 60
4 15 3 60 180
3 16 4 48
4 16 4 64 256
3 4 4 12
3 4 4 12 48
2 12 5 24
3 12 5 36 180
1 6 6 6
2 6 6 12 72
0 4 7 0
1 4 7 4 28
63 180
63 212 812

FIGURE 12.8: Number of distinct location lists of the maximal motifs in


Figure 12.7.
Topological Motifs 365

(1) (2) (3) (4)

(5) (6)

(7)

FIGURE 12.9: The input graph with with only two (vertex) attributes.

(1) (4) (5)

(2) (3) (6)

FIGURE 12.10: Consider the graph of Figure 12.9. The top 3 components
are isomorphic to each other and so are the bottom three components.
366 Pattern Discovery in Bioinformatics: Theory & Algorithms

|Vm |=4
|Em |=5
N4,5 =2
5, 7(3) 6, 7(3)
|Vm |=4
|Em |=4
N4,4 =2
5(2), 6(2), 7(3) 5, 6, 7(3)
|Vm |=4
|Em |=3
N4,3 =3
5(2), 6(2), 7(6) 5(4), 6(2), 7(6) 5(2), 6, 7(3)
|Vm |=3
|Em |=3
N3,3 =2
5, 7 5, 6(2), 7(3)
|Vm |=3
|Em |=2
N3,2 =2
5(3), 6, 7(3) 5(2), 6(3), 7(6)
|Vm |=2
|Em |=1
N2,1 =2
5(2), 6(3), 7(3) 5(3), 6(2), 7(3)
|Vm |=1
|Em |=0
N1,0 =2
5(3), 6(3), 7(3) 5, 6, 7

FIGURE 12.11: Maximal (connected) motifs that occur at least two times
in the input graph of Figure 12.9. Since the components (1), (4) and (5)
are isomorphic (or identical) and so are components (2), (3) and (6), the
occurrences are listed only for components (5), (6) and (7) for each motif.
Topological Motifs 367

For a set of vertices U ⊆ VM , let

Fi (U ) = {Fi (v) | v ∈ U (⊆ VM )}.

The location list of U , LU , is defined as

LU = {Fi (U ) | 1 ≤ i ≤ K ′ }.

If U is a singleton set,
U = {uj },
then its location list may also be written as

Luj .

LU , the location list of U is given by the following

LU = {F1 (U ), F2 (U ), . . . , FK ′ (U )}.

The graph induced by the vertices Fi (u), u ∈ Vm , is the ith occurrence


subgraph of motif M (VM ,EM ) on the input graph G(V, E).

Notice that a mapping (called F in the definition) is required to unambigu-


ously define an occurrence of a motif. Also when U is not a singleton set it
is usually a collection of vertices that have the same attribute and LU is a
multi-set or a set of sets of vertices.

12.3.1 Maximality
Consider the input graph with two connected components shown in Fig-
ure 12.12(a). Let the quorum be 2. A motif

M (VM , EM )

with
VM = {u1 , u2 , u3 }
is shown in Figure 12.12(b). The two occurrences of the motif with
1. att(u1 ) = blue,
2. att(u2 ) = green,
3. att(u3 ) = red,
are given as follows:
1. O1 = {v2 , v3 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v1 and
368 Pattern Discovery in Bioinformatics: Theory & Algorithms

v1 v4 v9
v6 v10
v5
v11
v2 v3 v7 v8
(a) Input graph with two connected components.

u4 u5
u4
u3 u3
u3
u1 u2 u1 u2 u1 u2

(b) Motif 1. (c) Motif 2. (d) Motif 3.

FIGURE 12.12: (a) The different attributes of the nodes (vertices) are
shown in different colors. v1 to v5 form one connected component and the
other is formed by v6 to v11 . (b) and (c) show two motifs that occur once in
each connected component of input graph. (d) The maximal version of these
motifs, i.e., no more vertices or edges can be added to this motif.

2. O2 = {v7 , v8 , v11 }, with


F2 (u1 ) = v7 , F2 (u2 ) = v8 and F2 (u3 ) = v11 .

The location list of the vertices of the motif are:

1. Lu1 = {v2 , v7 },

2. Lu2 = {v3 , v8 }, and

3. Lu3 = {v1 , v11 }.

Notice that Motif 1 is a subgraph of Motif 2 and Motif 2 is a subgraph of


Motif 3. Each of them occurs exactly two times in the input graph. Thus all
the information about Motifs 1 and 2 is already contained in Motif 3. This
calls for a notion of maximality of motifs, which we formally define below.

DEFINITION 12.2 (maximal motif, edge-maximal motif, vertex-maximal


motif ) Given G(V, E), let
M (Vm , Em )
be a topological motif with its complete occurrence list

O 1 , O 2 , . . . , Ol ,

with the mappings for 1 ≤ i ≤ l as

Fi : VM → O i .
Topological Motifs 369

• Edge-maximal: The motif M (Vm , Em ) is edge-maximal when for all


pairs u1 , u2 ∈ Vm ,

if (Fi (u1 )Fi (u2 )) ∈ E for all i, then (u1 u2 ) ∈ Em holds.

• Vertex-maximal: The motif M (Vm , Em ) is vertex-maximal when there


do not exist vertices

v1 , v1′ , v2 , v2′ , . . . , vl , vl′ ∈ V

such that for some u ∈ Vm ,

Fi (u) = vi′ , for each i,

and, the following hold for all i:

1. vi 6∈ Oi , vi′ ∈ Oi ,
2. att(vi ) = a, for some attribute a, and,
3. (vi vi′ ) ∈ E.

The motif is maximal if both edge-maximality and vertex-maximality hold.

In other words, edge-maximality ensures that no more edges can be added


to the motif and vertex-maximality ensures that no more vertices can be added
to the motif without altering the occurrence list. Continuing the example of
Figure 12.12, motifs 1 and 2 are nonmaximal. Since no more edges or vertices
can be added to motif 3, it is maximal.

12.4 Compact Topological Motifs


We first discuss an important issue with counting the number of occur-
rences of a motif: we call this the combinatorial explosion due to occurrence-
isomorphisms.

12.4.1 Occurrence-isomorphisms
We next consider a slightly modified input graph shown in Figure 12.13(a).
Here the vertex attributes black and white have both been replaced by red.
How does the problem scenario change?

Motif 1 of Figure 12.13(b). This motif is given by M (Vm , Em ) where


Vm = {u1 , u2 , u3 , u4 , u5 } and
370 Pattern Discovery in Bioinformatics: Theory & Algorithms

v1 v4 v9
v6 v10
v5
v11
v2 v3 v7 v8
(a) Input graph.

u1 u4 u5 u1 u4
u5 u4 u5
u3
u6
u2 u3 u1 u2 u2 u3

(b) Motif 1. (c) Motif 2. (d) A structure.

FIGURE 12.13: (a) The input graph with two connected components. (b)
and (c) show motifs that occur at least twice on the graph. (d) A structure
that is not a motif in (a).

att(u1 ) = att(u4 ) = att(u5 ) = red,

att(u2 ) = blue and

att(u3 ) = green.

The eight occurrences of this motif are given as follows:

• First connected component of the input graph:

1. O1 = {v1 , v2 , v3 , v4 , v5 }, with
F1 (u1 ) = v1 , F1 (u2 ) = v2 , F1 (u3 ) = v3 , F1 (u4 ) = v4 , F1 (u5 ) = v5 .
2. O2 = {v1 , v2 , v3 , v4 , v5 }, with
F1 (u1 ) = v1 , F1 (u2 ) = v2 , F1 (u3 ) = v3 , F1 (u4 ) = v5 , F1 (u5 ) = v4 .

• Second connected component of the input graph:

3. O3 = {v6 , v7 , v8 , v9 , v10 }, with


F1 (u1 ) = v6 , F1 (u2 ) = v7 , F1 (u3 ) = v8 , F1 (u4 ) = v9 , F1 (u5 ) = v10 .
4. O4 = {v6 , v7 , v8 , v10 , v9 }, with
F1 (u1 ) = v6 , F1 (u2 ) = v7 , F1 (u3 ) = v8 , F1 (u4 ) = v10 , F1 (u5 ) = v9 .
5. O5 = {v6 , v7 , v8 , v10 , v11 }, with
F1 (u1 ) = v6 , F1 (u2 ) = v7 , F1 (u3 ) = v8 , F1 (u4 ) = v10 , F1 (u5 ) =
v11 .
6. O6 = {v6 , v7 , v8 , v11 , v10 }, with
F1 (u1 ) = v6 , F1 (u2 ) = v7 , F1 (u3 ) = v8 , F1 (u4 ) = v11 , F1 (u5 ) =
v10 .
Topological Motifs 371

7. O7 = {v6 , v7 , v8 , v9 , v11 }, with


F1 (u1 ) = v6 , F1 (u2 ) = v7 , F1 (u3 ) = v8 , F1 (u4 ) = v9 , F1 (u5 ) = v11 .
8. O8 = {v6 , v7 , v8 , v11 , v9 }. with
F1 (u1 ) = v6 , F1 (u2 ) = v7 , F1 (u3 ) = v8 , F1 (u4 ) = v11 , F1 (u5 ) = v9 .

When attributes of two or more vertices of the motif are identical, sometimes
they can be mapped to a fixed set of vertices of the input graph in combi-
natorially all possible ways. For example u4 and u5 of the motif are mapped
onto the pair v4 and v5 in two possible ways (given by, F2 and F3 ). Similarly,
u4 and u5 are mapped to any two of v9 , v10 and v11 in six possible ways.
We term this explosion in the number of distinct mappings as combinatorial
explosion due to occurrence-isomorphism.

Motif 2 of Figure 12.13(b). This motif is given by M (Vm , Em ) where


Vm = {u1 , u2 , u3 , u4 , u5 } and

att(u1 ) = blue,

att(u2 ) = green, and

att(u3 ) = att(u4 ) = att(u5 ) = red.

The twelve occurrences of this motif are given as follows:

• First connected component of the input graph:

1. O1 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v4 , F1 (u4 ) = v5 , F1 (u5 ) = v1 .
2. O2 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v4 , F1 (u4 ) = v1 , F1 (u5 ) = v5 .
3. O3 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v1 , F1 (u4 ) = v4 , F1 (u5 ) = v5 .
4. O4 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v1 , F1 (u4 ) = v5 , F1 (u5 ) = v4 .
5. O5 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v5 , F1 (u4 ) = v1 , F1 (u5 ) = v4 .
6. O6 = {v2 , v3 , v4 , v5 , v1 }, with
F1 (u1 ) = v2 , F1 (u2 ) = v3 , F1 (u3 ) = v5 , F1 (u4 ) = v4 , F1 (u5 ) = v1 .

• Second connected component of the input graph:

7. O7 = {v7 , v8 , v9 , v10 , v11 }, with


F1 (u1 ) = v7 , F1 (u2 ) = v8 , F1 (u3 ) = v9 , F1 (u4 ) = v10 , F1 (u5 ) =
v11 .
372 Pattern Discovery in Bioinformatics: Theory & Algorithms

8. O8 = {v7 , v8 , v9 , v10 , v11 }, with


F1 (u1 ) = v7 , F1 (u2 ) = v8 , F1 (u3 ) = v9 , F1 (u4 ) = v11 , F1 (u5 ) =
v10 .
9. O9 = {v7 , v8 , v9 , v10 , v11 }, with
F1 (u1 ) = v7 , F1 (u2 ) = v8 , F1 (u3 ) = v10 , F1 (u4 ) = v11 , F1 (u5 ) =
v9 .
10. O10 = {v7 , v8 , v9 , v10 , v11 }, with
F1 (u1 ) = v7 , F1 (u2 ) = v8 , F1 (u3 ) = v10 , F1 (u4 ) = v9 , F1 (u5 ) =
v11 .
11. O11 = {v7 , v8 , v9 , v10 , v11 }, with
F1 (u1 ) = v7 , F1 (u2 ) = v8 , F1 (u3 ) = v11 , F1 (u4 ) = v9 , F1 (u5 ) =
v10 .
12. O12 = {v7 , v8 , v9 , v10 , v11 }, with
F1 (u1 ) = v7 , F1 (u2 ) = v8 , F1 (u3 ) = v11 , F1 (u4 ) = v10 , F1 (u5 ) =
v9 .

In this example, the combinatorial explosion due to occurrence-isomorphism


is from the motif vertices u3 , u4 , u5 being mapped in all possible ways to
v4 , v5 , v1 in the first connected component and to v9 , v10 , v11 in the second
connected component of the input graph.

12.4.2 Vertex indistinguishability


Is it possible to count and describe the distinct occurrences without the
combinatorial explosion as seen in the last two examples? For this we need
to first recognize indistinguishable vertices, which are defined below. Given a
graph 3
M (VM , EM )
the vertices in
U1 ⊆ VM
are indistinguishable w.r.t. (w.r.t.) U2 ⊆ VM if and only if

(1) att(ui ) = a1 , and att(uj ) = a2 , for all ui ∈ U1 , uj ∈ U2 , for some


attributes a1 and a2 , and,

(2) there is an edge (ui uj ) ∈ EM , for each ui ∈ U1 and uj ∈ U2 .

Vertices in U1 are said to be indistinguishable from each other w.r.t. U2 .


Further, if there exists no
U1′ ⊇ U1

3 Although usually a graph is written as G(V, E), here we use a motif notation M (VM , EM )
for the graph to emphasize the fact that indistinguishability of vertices is primarily associ-
ated with motifs (but evidenced in the input graph).
Topological Motifs 373

such that U1′ is indistinguishable w.r.t. U2 , then the set U1 is maximally


indistinguishable w.r.t. U2 .

12.4.3 Compact list


To ease handling of the combinatorial explosion due to occurrence isomor-
phisms, we introduce the compact list notation for the location lists of topo-
logical motifs [Par07a]. It is often possible to represent LU in a much more
compact way taking into account the fact that the vertices in U are indistin-
guishable.
For instance, in the motif in Figure 12.13(b),

U = {u4 , u5 }

is a maximal set of indistinguishable vertices (w.r.t {u3 }) and

LU = {{v4 , v5 }, {v9 , v10 }, {v11 , v10 }, {v9 , v11 }}.

However, a more compact way to represent LU is by the set,

LcU = {{v4 , v5 }, {v9 , v10 , v11 }}.

and one recovers LU from LcU by taking all two (the smallest cardinality of
the sets in LU ) element subsets of the sets in LcU . We call

1. LU the expansion of the set LcU , and

2. LcU a compact form of LU .

In the rest of the discussion, we denote a compact list LcU simply by LU


and it should be clear from the context what we mean.

12.4.4 Compact vertex, edge & motif


Next, we define a compact vertex and a compact edge which is a natural
next step after recognizing indistinguishable vertices and compact location
lists. Given a maximal motif

M (VM , EM ),

if

1. U1 (⊂ VM ) is maximally indistinguishable w.r.t. U2 (⊂ VM ) and

2. U2 is maximally indistinguishable w.r.t. U1 ,

then U1 and U2 are called compact vertices. Further, each of the following
holds.
374 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. Each vertex in Ui has the same attribute, for i = 1, 2.

2. U1 ∩ U2 = ∅.

3. Since there is an edge from each vertex u1 ∈ U1 to each vertex u2 ∈ U2 ,


there are
|U1 | × |U2 |
edges between U1 and U2 . We represent the |U1 | × |U2 | edges by a single
edge, written as
(U1 U2 ).
This is also called a compact edge. In other words, there is a compact
edge between two compact vertices.

This naturally leads to the compact notation for the motif where the vertex
set is a collection of compact vertices and compact edges defined on them. For
convenience, we represent a compact motif as C with the following notation

C(VC , EC ) ≡ M (VM , EM ),

where UC is the set of compact vertices and EC the set of compact edges.
It is important to note that two compact vertices may have a nonempty
intersection. The second example shown in Figure 12.14 illustrates such a
nonempty intersection.

12.4.5 Maximal compact lists


We have defined compact lists, but what is a maximal compact list? As
we have seen earlier a compact list is inalienably associated with a compact
vertex. The maximality of a compact list is ‘inherited’ from the associated
compact vertex.
Given a graph G(V, E), we say a compact list L is a maximal compact list,
if there exists some maximal compact motif C(VC , EC ) such that L = LU
with U ∈ VC .
The maximality property of the compact list is central to the discovery
method of [Par07a]. Great care is taken to ensure that a new compact list
that is generated is maximal. With this in mind, we discuss some operations
that maintain this maximality of the list.

12.4.6 Conjugates of compact lists


In the following discussion let D be the number of distinct attributes in the
given graph G(V, E). Also for an attribute x, let

Vx = {v ∈ V | att(v) = x}.
Topological Motifs 375

u5
u4
u3
u1 u2

M (VM , EM ) C(VC , EC )
VM = {u1 , u2 , u3 , u4 , u5 } VC = {U1 , U2 , U3 }
EM = {u1 u2 , u2 u3 , u2 u4 , u2 u5 } EC = {U1 U2 , U2 U3 }

In the compact notation


U1 = {u1 },
U2 = {u2 } and
U3 = {u3 , u4 , u5 }.
(a)

u1 u4
u5
u6
u2 u3

M (VM , EM ) C(VC , EC )
VM = {u1 , u2 , u3 , u4 , u5 , u6 } VC = {U1 , U2 , U3 , U4 }
EM = {u1 u2 , u2 u3 , u3 u4 , u3 u5 , u3 u6 } EC = {U1 U2 , U2 U3 , U3 U4 }

In the compact notation


U1 = {u1 },
U2 = {u2 },
U3 = {u3 } and
U4 = {u1 , u4 , u5 , u6 }.
(b)

FIGURE 12.14: Two examples of motifs with their usual notation and the
corresponding compact notation. Notice that in the motif in (b), two compact
vertices U1 and U4 have a nonempty intersection, i.e., U1 ∩ U4 = {u1 }.
376 Pattern Discovery in Bioinformatics: Theory & Algorithms

Conjugate of compact vertices. If edge (v1 v2 ) ∈ E, then v2 is an imme-


diate neighbor of v1 and v1 is an immediate neighbor of v2 . Just as a vertex
has an immediate neighbor in a graph, we define such a notion for a compact
vertex and call it the conjugate.
For attributes a and x, let L ⊂ Va . Then
conjx (L) = {v ∈ Vx | for each v ′ ∈ L (vv ′ ) ∈ E}. (12.1)
Thus
1. conjb (L) is a conjugate if L and
2. L is a conjugate of conjb (L).
Further, L has at most D nonempty conjugates.
The conjugates can also be seen from the perspective of a compact motif
as follows. Let U1 and U2 be compact vertices in a compact motif
C(VC , EC ),
with
(U1 U2 ) ∈ EC .
Then we call
1. U1 to be a conjugate of U2 and
2. similarly U2 to be a conjugate of U1 .
This leads to the following:
For each L1 ∈ LU1 there is at least one L2 ∈ LU2 such that
for each pair v1 ∈ L1 , v2 ∈ L2 , (v1 v2 ) ∈ E.
Similarly for each L2 ∈ LU2 there is at least one L1 ∈ LU1 such that
for each pair v1 ∈ L1 , v2 ∈ L2 , (v1 v2 ) ∈ E.

Then L1 is a conjugate of L2 and L2 is a conjugate of L1 .

Conjugate of location lists. For an attribute x, the conjugate list of a


given list L is defined as follows (using Equation (12.1)):
Conjx (L) = {conjx (L) | L ∈ L}. (12.2)
Note that L has at most D nonempty conjugates.
The following is a critical property of the conjugate list that we exploit in
the algorithm to discover the compact (maximal) motifs. This fact can be
verified (using proof by contradiction) and we leave the proof as an exercise
for the reader.

THEOREM 12.1
(Conjugate maximality theorem) The conjugate of a maximal location
list is maximal.
Topological Motifs 377

Continuing example. Consider motif 2 of the example in Figure 12.13.

Conjgreen (LU ) = {{v3 }, {v8 }, {v8 }, {v8 }}


= {{v3 }, {v8 }}
= {v3 , v8 }. (12.3)

The conjugate of the compact list is shown as below,

Conjgreen (LU ) = {{v3 }, {v8 }}


= {v3 , v8 }.

Conjugate notation. In an implementation, the conjugate relation is stored


as pointers. However in the description here, we show the conjugate relation
of the list as m, and that of each of its elements as l. For example,

LU = {{v4 , v5 }, {v9 , v10 , v11 }}


m l l
Lu2 = {v3 , v8 } (12.4)
m l l
Lu1 = {v2 , v7 }

Note that in this example the following hold:

1. Conjred (Lu2 ) = LU and Conjblue (Lu2 ) = Lu1 ,

2. Conjgreen (LU ) = Lu2 , and

3. Conjblue (Lu1 ) = Lu2 .

Multiplicity in (compact) location lists. Consider the location list given


in Equation (12.3). Here we have replaced three instances of v8 , as determined
by the conjugates, with just one. In the rest of the treatment, we ignore such
multiplicities of the elements of the location list. Note that there is no loss of
information. For instance, consider the first two lists in Equation (12.4) along
with the conjugacy relations:

LU = {{v4 , v5 }, {v9 , v10 , v11 }}


m l l
Lu2 = {v3 , v8 }

This implicity implies the following conjugacies:

LU = {{v4 , v5 }, {v9 , v10 }, {v9 , v11 }, {v10 , v11 }}


m l l l l
Lu2 = {v3 , v8 , v8 , v8 }
378 Pattern Discovery in Bioinformatics: Theory & Algorithms

12.4.7 Characteristics of compact lists


Given a graph G(V, E) and a maximal motif
M (VM , EM )
on this graph, let
U (⊆ VM )
be a set of maximal indistinguishable vertices (w.r.t. some U ′ ⊂ VM ), the the
compact location list of U is written as: 4
LU = {L1 , L2 , . . . , Lℓ } ⊂ 2V .
Then the five characteristics of LU are as follows:
1. Let

d = min |Li |.
i=1
If for some L′ ∈ L,
|L′ | = d,
then L′ is called a discriminant of L. We write
d = discSz(L).

2. [
f lat(L) = L.
L∈L

3. Expansion of L, Exp(L) ⊂ 2V is given as:


Exp(L) = {L ∈ 2V | there exists some Li ∈ L with L ⊂ Li and |L| = d}.

4. All the vertices in f lat(L) have the same attribute given by att(L).
5. For 1 ≤ l ≤ ℓ, let
ELl = {(v1 v2 ) ∈ E | v1 , v2 ∈ Ll }.
Then
GLl (Ll , ELl )
is called the induced subgraph on Ll . Further, if for each u1 , u2 ∈ Ll ,
(u1 u2 ) ∈ ELl , then the graph GLl is called a clique.
clq(L) is an indicator that is set 1 if all the ℓ induced subgraphs are
cliques. Formally,

1, if for each L ∈ L, GL (L, EL ) is a clique,
clq(L) =
0, otherwise.

4 For
example, compact list LU = {{v1 , v2 , v3 }, {v4 , v5 }} is written as LU = {L1 , L2 }, where
L1 = {v1 , v2 , v3 } and L2 = {v4 , v5 }.
Topological Motifs 379

Continuing example. We compute the five characteristics for the example


of the motif in Figure 12.13(b). Note that

U1 = {u4 , u5 }

is a maximal set of indistinguishable vertices w.r.t.

U2 = {u3 }

and vice-versa. The compact location lists of U1 and U2 along with their
characteristics is given below.

(a) LU1 = {{v4 , v5 }, {v9 , v10 , v11 }}.

1. Discriminant of LU1 is {v4 , v5 } and thus discSz(LU1 ) = 2.


2. f lat(LU1 ) = {v4 , v5 , v9 , v10 , v11 }.
3. Exp(LU1 ) = {{v4 , v5 }, {v9 , v10 }, {v10 , v11 }, {v9 , v11 }}.
4. att(LU1 )= red,
5. clq(LU1 ) = 0.

(b) LU2 = {{v3 }, {v8 }} or simply Lu3 = {v3 , v8 } with att(LU2 )= green.

Next consider the motif in Figure 12.13(c).

U3 = {u3 , u4 , u5 }

is a maximal set of indistinguishable vertices w.r.t.

U4 = {u2 }

and vice-versa. The compact location lists of U3 and U4 is given as:

(a) LU3 = {{v1 , v4 , v5 }, {v9 , v10 , v11 }}.

1. Discriminant of LU3 is {v1 , v4 , v5 } and so is {v9 , v10 , v11 }. Thus


discSz(LU3 ) = 3.
2. f lat(LU3 ) = {v1 , v4 , v5 , v9 , v10 , v11 }.
3. Exp(LU3 ) = LU3 .
4. att(LU3 )= red,
5. clq(LU3 ) = 0.

(b) LU4 = {{v3 }, {v8 }} or simply Lu2 = {v3 , v8 } with att(LU4 )= green.
380 Pattern Discovery in Bioinformatics: Theory & Algorithms

12.4.8 Maximal operations on compact lists


Once we recognize that a compact list is merely a concise notation for its
expansion, the following is easy to see.
1. L1 =c L2 if and only if Exp(L1 ) = Exp(L2 ).
In the rest of the chapter we use ‘=’ also as an assignment symbol. Thus
when we say L1 = L2 , we mean that list L1 is assigned to be the list
L2 .
2. L1 ⊂c L2 if and only if
for each L1 ∈ L1 , there exists some L2 ∈ L2 such that L1 ⊂ L2 .
3. The intersection of two compact lists is written as

L 3 = L 1 ∩c L 2 ,

and is defined as follows:


v1 , v2 ∈ L3 (∈ L3 )
if and only if v1 , v2 ∈ L1 , L2 for some L1 ∈ L1 and L2 ∈ L2 .
The intersection of p compact lists,

L 1 ∩c L 2 ∩c . . . ∩c L p ,

can be generalized from intersection of two lists.


The following can be verified and is left as an exercise for the reader.

f lat(L3 ) = f lat(L1 ) ∩ f lat(L2 ).

4. The union of two compact lists is written as

L 3 = L 1 ∪c L 2 ,

and is defined as follows:


 
L1 ∈ L1 and L2 ∈ L2 and
L3 = L1 ∪ L2 .
L1 ⊂ L2 or L2 ⊂ L1

The union of p compact lists,

L 1 ∪c L 2 ∪c . . . ∪c L p ,

can be generalized from union of two lists.


The following can be verified and is left as an exercise for the reader.

f lat(L3 ) = f lat(L1 ) ∪ f lat(L2 ).


Topological Motifs 381

5. The difference of two compact lists is written as

L3 = L1 \c L2 ,

and is defined as follows:


v1 ∈ L3 (∈ L3 ) if and only if
1. v1 , v2 ∈ L1 and
2. v2 ∈ L2 , but v1 6∈ L2 ,
for some L1 ∈ L1 and L2 ∈ L2 .
However, when
L2 = L1 ∩c L′1 ,
for some L′1 , then,

L3 = L1 \c L2
= L1 \c (L1 ∩c L′1 )
= {L1 \ L2 | L1 ∈ L1 , L2 ∈ L2 and L1 ∩ L2 6= ∅}.

Note that in general, given maximal compact lists L1 and L2 ,

L1 \c L2

is not necessarily maximal.

12.4.9 Maximal subsets of location lists


A location list can have a large number of subsets, but which subsets are
maximal? Our interest is only in these specific subsets. We define two ways
of generating new lists given one location list L.
1. Let

d = discSz(L) and
dmax = max (|Li |) .
Li ∈L

Enrich(L) is a set of lists defined as


 
p = |Li | for some Li ∈ L, and
Enrich(L) = L(p) ,
d ≤ p ≤ dmax .
where
L(p) = {Li ∈ L | |Li | ≥ p} .
Thus new lists are generated only when

dmax > d > 1,


382 Pattern Discovery in Bioinformatics: Theory & Algorithms

and for each such newly generated list L ∈ Enrich(L),

discSz(L) > d.

2. Let
Li (⊂ V ),
where all the vertices have the same attribute. If

U ⊂ Li

induces a clique,5 then it is also a maximal clique if there is no U ′ such


that
U ( U ′ ⊆ Li
that also induces a clique on the input graph.
Let
maxClk(Li ) = {L1i , L2i , . . . , Lki }
where each Lj , 1i ≤ j ≤ ki induces a maximal clique. Then clique(L)
is a list defined as:

clique(L) = {Lji | Lji ∈ maxClk(Li ) and Li ∈ L}.

Clearly
clq(clique(L)) = 1.

Note that Enrich(L) results in possibly more than one new list whereas
clique(L) results in at most one new list.
We next make another observation that is important for the algorithm that
is discussed in the later sections. It states that conjugate lists can be computed
by simply taking appropriate subsets of other known compact lists rather than
using the original graph G(V, E) as in Equation (12.1).

LEMMA 12.1
(Intersection conjugate lemma) For lists

L1 , L2 , . . . , Lp

let the conjugates for an attribute a be

L1a , L2a , . . . , Lpa .

For
1 ≤ j ≤ p,

5 See Section 12.4.7 for the definition of clique.


Topological Motifs 383

define relations Rj as follows:



Lj is a conjugate of Laj
(Lj , Laj ) ∈ Rj ⇔
for Lj ∈ Lj and Laj ∈ Laj .
Next, let
L ′ = L 1 ∩c L 2 ∩c . . . ∩c L p .
Then the conjugate, L′a , of each element
L′ ∈ L′
6
is given as follows:
!
L′ ⊂ Lji ∈ Lj and
[ [ 
L′a = Laji where
(Lji , Laji ) ∈ Rj .
j i

Further, if each of the following is maximal (then Lc is maximal):


L1 , L2 , . . . , Lp ,
L1a , L2a , . . . , Lpa ,
then this conjugate, L′a , is the same set given by
conja (L′ ) (of Equation (12.1))
that is obtained directly using the input graph G(V, E).

COROLLARY 12.1
(Subset conjugate lemma) For an attribute a, let La be a conjugate of L.
Define a relation R as follows:

L is a conjugate of La
(L, La ) ∈ Rj ⇔
for L ∈ L and La ∈ La .
Let
L′ ⊂c L.
Then the conjugate, L′a , of each element L′ ∈ L′ is given as follows.
 ′
[ L ⊂ Li ∈ L and
L′a = Lai , where
(Li , Lai ) ∈ R.
i

Further, if L, La and L′ are maximal, then this conjugate, L′a , is the same
set given by
conja (L′ ) (of Equation (12.1))
that is obtained directly using the input graph G(V, E).

6 The two ‘unions’ here are due to the fact that in the intersection set L′ , more than one

element (denoted here as Lji ) of Lj may be involved.


384 Pattern Discovery in Bioinformatics: Theory & Algorithms

12.4.10 Binary relations on compact lists


We study some binary relations on compact lists that will be used in the
following section to construct the compact motifs. Recall that binary relations
can be captured by a graph.
1. (complete intersection) An intersection of L1 of L2 is called a complete
intersection if
L1 ∩c L2 is a complete subset of L1 and L2 .
Note that L′ ⊂c L is a complete subset if
for each L ∈ L there is L′ ∈ L′ such that L′ ⊂ L.
2. (compatible) L1 and L2 are compatible if
(a) L1 ∩c L2 = ∅, or,
(b) if L1 ∩c L2 6= ∅, then the intersection is complete.
3. (incompatible) If L1 and L2 are not compatible they are called incom-
patible.
4. (conjugate) We have already seen the condition when L1 is a conjugate
of L2 (see Equation (12.2)).

12.4.11 Compact motifs from compact lists


How are compact motifs computed from compact lists? We define a meta-
graph
G(L, E)
on a collection of compact lists L where the labeled edges are defined as
follows. Figure 12.15 shows a concrete example.
1. With a slight abuse of notation, we call a compact list
L∈L
a vertex in this meta-graph.
2. The edges in the meta-graph are of three types:7
(a) If L2 is a conjugate of L1 then
(L1 L2 ) ∈ E,
and the edge type is ‘link’. In Figure 12.15(a) this is shown as a
regular solid edge.

7 To avoid confusion with attributes on vertices we call this the ‘edge type’ instead of ‘edge
attribute’.
Topological Motifs 385

2 1

4 5
3
(a) Meta-graph G(L, E)

2 1 2

5 4 5
3 3
C1 (VC1 , EC1 ) C2 (VC2 , EC2 )

(b) The two maximal connected consistent subgraphs.

u1 u4 u5
u5 u4
u3
u2 u3 u1 u2

M1 (VM1 , EM1 ) M2 (VM2 , EM2 )

(c) Maximal motifs.

FIGURE 12.15: (a) A meta-graph where the the ‘forbidden’ edge is shown
in bold (connects an inconsistent pair), the ‘subsume’ edge is shown dashed
(connects a complete intersection pair) and the remaining edges are ‘link’
edges (denoting the conjugacy relation). The correspondence between the
node numbers shown here to the location lists shown in Figure 12.17 are as
follows: 1 ↔ L1 , 2 ↔ L2 , 3 ↔ L3 , 4 ↔ L4 , 5 ↔ L4 \c L1 . The singleton lists
are not shown here. (b) and (c) show the MCCSs and the maximal motifs
respectively.
386 Pattern Discovery in Bioinformatics: Theory & Algorithms

v1 v4 v9 v1 v4 v9
v6 v10 v6 v10
v5 v5
v11 v11
v2 v3 v7 v8 v2 v3 v7 v8
(a) Input graph G1 . (b) Input graph G2 .

B1 B2

v1 v2 v3 v1 v2 v3
v4 v3 v4 v5 v3
v5 v3 v5 v4 v3
v6 v7 v6 v7
v9 v8 v9 v10 v8
v10 v8 v10 v9 v11 v8
v11 v8 v11 v10 v8
v2 v1 v3 v2 v1 v3
v7 v6 v8 v7 v6 v8
v3 v1 v4 v5 v2 v3 v1 v4 v5 v2
v8 v9 v10 v11 v7 v8 v9 v10 v11 v7

(c) Adjacency matrix B1 for G1 . (d) Adjacency matrix B2 for G2 .

FIGURE 12.16: The two running examples G1 and G2 with their adja-
cency matrices. Each graph has two connected components.
Topological Motifs 387

Initialization (Linit )
L1 = {v1 , v6 }
m l l
L2 = {v2 , v7 }
m l l
L3 = {v3 , v8 }
m l l
L4 = {{v1 , v4 , v5 }, {v9 , v10 , v11 }}
L1 and L4 are inconsistent.

Iterative Step
L4 \c L1 = {{v4 , v5 }, {v9 , v10 , v11 }}
m l l
L3 = {v3 , v8 }
L1 and L4 \c L1 are consistent.

L 1 ∩c L 4 = {v1 }
m l
L′2 = {v2 }
m l
L′3 = {v3 }
m l
L′4 = {{v1 , v4 , v5 }}
L1 ∩c L4 and L′4 are consistent.

L1 \c L4 = {v6 }
m l
L′2 = {v7 }
m l
L′3 = {v8 }
m l
L′4 = {{v9 , v10 , v11 }}
L1 \c L4 and L′4 are consistent.

Motifs with quorum K = 2.


u1 u4 u5
u5 u4
u3
u2 u3 u1 u2
(a) (b)

FIGURE 12.17: The solution for the input graph shown in Fig-
ure 12.16(a). See text for details.
388 Pattern Discovery in Bioinformatics: Theory & Algorithms

Initialization (Linit )
L1 = { v1 , v6 }
m l l
L2 = { v2 , v7 }
m l l
L3 = { v3 , v8 }
m l l
L4 = { {v1 , v4 , v5 }, {v9 , v10 , v11 } }
L1 and L4 are inconsistent.

L5 = {v4 , v5 , {v9 , v11 }, v10 }


m l l l l
L5 = {v5 , v4 , v10 , {v9 , v11 }}

L6 = { {v4 , v5 }, {v9 , v10 }, {v10 , v11 } }


L5 is replaced by L6 .

Iterative Step
L4 \c L1 = { {v4 , v5 }, {v9 , v10 , v11 } }
m l l
L3 = { v3 , v8 }
L1 and L4 \c L1 are consistent.

clique(L4 ) =
L6 = { {v4 , v5 }, {v9 , v10 }, {v10 , v11 }}
m l l l
L3 = { v3 , v8 , v8 }
L1 and L6 are consistent.

Motifs with quorum K = 2.


u1 u4 u5
u5 u4
u3
u2 u3 u1 u2
(a) (b)

FIGURE 12.18: The solution for the input graph shown in Fig-
ure 12.16(b). Note that singleton location lists are not shown. See text for
details.
Topological Motifs 389

(b) Let L1 and L2 belong to a connected component (of link edges) of


the meta-graph.
If L1 and L2 are compatible then they are called consistent and if
they are incompatible, they are called inconsistent. If L1 and L2
are inconsistent, then
(L1 L2 ) ∈ E,
and the edge type is ‘forbidden’. In Figure 12.15(a) this is shown
as a bold edge and this is always between vertices (location lists)
that have the same attribute.
(c) If
L2 = L1 \c L3 ,
for some L3 , then
(L1 L2 ) ∈ E,
and the edge type is ‘subsume’. In Figure 12.15(a) this is shown as
a dashed edge and the edge is also always between vertices (location
lists) that have the same attribute.

A connected subgraph of G(L, E) that has a pair of inconsistent vertices is


called an inconsistent subgraph. A subgraph that has no inconsistent pair is
called a consistent subgraph. Why is this property important? 8
Rationale: If the compact vertices form a cycle (of link edges) on the meta-
graph, then these are equivalent to cycles in the motif and then it is important
to check that the cycles occur in each occurrence of the motif (by checking
the location list).
In the case that the cycle is formed only in some but not all occurrences,
then we remove the inconsistency by computing new lists

L1 \c L2 and L2 \c L1 ,

which excludes these common (offending) vertices. It is easy to see that for
an inconsistent pair L1 and L2 ,

1. the two must lie on a cycle (of link edges) in the meta-graph G(L, E)
and

2. att(L1 ) = att(L2 ).

8 Most patternists (scientists who specialize in pattern discovery in data) with an honest

regard for mathematics, shudder at the thought of patterns in graphs because of the sheer
promiscuity that the vertices display in terms of how many neighbors (partners) each can
sustain. We also get no respite from the consequences of this. So, it is not surprising that
both graph-theoretic and set-theoretic tools are used for this (that of connectedness in the
meta-graph and complete intersection of compact lists).
390 Pattern Discovery in Bioinformatics: Theory & Algorithms

Before we get down to the task of computing the compact motifs, we sum-
marize the important observations as a theorem.

THEOREM 12.2
(Maximal begets maximal theorem) If L1 and L2 are maximal, then the
following statements hold.

1. Each conjugate list of L1 is maximal.

2. Each list in Enrich(L1 ) is maximal.

3. clique(L1 ) is maximal.

4. L1 ∩c L2 is maximal.

5. L1 \c L2 and L2 \c L1 are maximal, if L1 and L2 are inconsistent.

6. Li \c (L1 ∩c L2 ) is not necessarily maximal for i = 1, 2.

The following theorem summarizes the observation and is central to extracting


the maximal motifs.

THEOREM 12.3
(Compact motif theorem) Given an input graph, let its meta-graph be
given as
G(L, E).

A subgraph
C(VC , EC )

is a maximal connected consistent subgraph (MCCS) of the meta-graph, if it


satisfies the following:

1. (connected) For any two vertices in VC there is a path between the two
where each edge is of type ‘link’.

2. (consistent) For no vertices L1 , L2 ∈ VC is the edge

(L1 L2 ) ∈ EC

of the type ‘forbidden’.

3. (maximal) No more vertices can be added to VC satisfying the above two


conditions.
Topological Motifs 391

Next,
C(VC , EC )
defines a (maximal) compact motif on input graph G(V, E).

The construction of all the MCCSs in a meta-graph is discussed in Exer-


cise 153. The construction of a maximal motif from each MCSS is discussed
below.

MCCS to compact motif. Let the MCCS be given by

C(VC , EC ).

We wish to compute
M (VM , EM )
from C(VC , EC ). See Figure 12.14 for the notation and concrete examples.
For a location list Li ∈ VC , let

att(Li ) = ai and
discSz(Li ) = di .

Then the compact vertex Ui corresponding to Li is specified as

Ui = {ui1 , ui2 , . . . , uidi },


⊂ VM ,

and for each 1 ≤ j ≤ di ,



att uij ∈ VM = att(Ui ).

Also if
clq(Li ) = 1,
then an edge is introduced in the motif between every pair of vertices, i.e., for
each 1 ≤ j < k ≤ di ,
(uij uik ) ∈ EM .
However, if the edge type of (Li Lj ) is ‘subsume’, i.e., one is a complete subset
of the other, then without loss of generality let

di ≤ dj ,

and di vertices in Ui and Uj get the same labels, i.e.,

Ui = {ui1 , ui2 , . . . , uidi },


⊂ VM ,
Uj = {ui1 , ui2 , . . . , uidi , ui(di +1) , ui(di +2) . . . , udj }
⊂ VM .
392 Pattern Discovery in Bioinformatics: Theory & Algorithms

This is demonstrated in Figure 12.15(2b)-(2c). Note that the node marked


by 5 (L4 \c L1 ) is a complete subset of node marked by 4 (L4 ) leading to
the labeling of the nodes in Figure 12.15(2c). Notice that for all practical
purposes, node 5 may be removed (since it is subsumed by node 4).
Given an input graph G(V, E), we discuss in the next section how to com-
pute the set of all maximal lists L.

What’s next?
Given an input graph G(V, E) and a quorum K, we have stated the need
to discover all the maximal topological motifs (and their occurrences in the
graph) that satisfy this quorum constraint. We have then painstakingly con-
vinced the reader that we want these maximal motifs to be in the compact
form. The occurrences of these compact motifs on the input graph is described
by compact lists.9
Now that we know ‘what’ we want, we must next address the question of
‘how’. Thus the next natural step is to explore methods to compute these
compact motifs and lists from the input.

12.5 The Discovery Method


Main idea. The method is based on a simple observation that given a
graph a topological motif can be represented either as a motif (graph) or as a
collection of location lists of the vertices of the motif. By taking a frequentist
approach to the problem, the method discovers the multiply occurring motifs
by working in the space of location lists. There are two aspects that lend
themselves to efficient discovery:

1. The motifs are maximal, so a relatively small number of possibilities are


to be explored.
For instance if a graph has exactly three red vertices with 5, 8 and
10 blue immediate neighbors (along with other edges and vertices with
other colors), then any maximal motif with a red vertex can have exactly
5 blue neighbors or exactly 8 blue neighbors, although a subgraph could

9 Note however that the nature of the beast is such that even compact lists may not save

the day: see Exercise 154 based on the example of Figure 12.4.
Topological Motifs 393

have from 0 to 10 blue neighbors. The compact location list captures


this succinctly.
2. A single conservative intersection operation handles all the potential
candidates with a high degree of efficiency.

Method overview. The discovery process consists of mainly two steps. In


the first step an exhaustive list of potential location lists of vertices of these
motifs is computed, which is stored in a compact form, as compact location
lists. In the second step this collection of compact location lists computed in
the first step is enlarged by including all the nonempty intersections, amongst
the location lists computed in the first step. The collection of compact location
lists so obtained has the nice property that every compact (maximal) motif
has a compact vertex whose location list appear in compact form in this
collection. Conversely, every compact location list appearing in this collection
is the location list of some compact vertex compact (maximal) motif. The
intersection operations at different stages are carried out in an output-sensitive
manner: this is possible since we are computing maximal intersections.

Input: The input is a graph G(V, E) where an attribute is associated with


each vertex. For the sake of convenience, let B be a two-dimensional array of
dimension |V | × D, where D is the total number of distinct attributes, and,
which encodes the graph as follows (1 ≤ i ≤ |V |, 1 ≤ j ≤ D):
B[i][j] is the set of vertices adjacent to vi having the attribute aj .
B is called the adjacency matrix. Two examples are illustrated in Figure 12.16.
Output: All compact maximal motifs
C(Vc , Ec ) along with LU for each U ∈ VC .

12.5.1 The algorithm


The algorithm works by first generating a small collection of maximal lists
and their conjugates (which are also maximal),
Linit ,
called the initialization step. In the iterative step, more maximal lists are care-
fully added to this initial list through the maximal set generating operations
(such as set intersections ∩c , set difference \c , conjugate conj(·), Enrich(·))
described in the earlier sections, called the iterative step.

12.5.1.1 Initialization: generating Linit


At the very first step we compute a collection of compact lists Linit . This
collection of maximal lists is characterized as follows:
L ∈ Linit , if and only if there exists no maximal L′ such that L ⊂c L′ .
394 Pattern Discovery in Bioinformatics: Theory & Algorithms

Recall that we have defined maximality of a location list in terms of the


maximal motifs. Then, in the absence of the output motifs, how do we know
which list is maximal?
For a pair of attributes, ai , aj , two location lists Lai and Laj are constructed
where

att(Lai ) = ai ,
att(Laj ) = aj ,
Conjai (Laj ) = Lai ,
Conjaj (Lai ) = Laj .

Ensuring the maximality of these lists is utterly simple since it can almost
be read off the incidence matrix B. This is best understood by following a
simple concrete example. We show two such examples in Figure 12.16. But
we must also avoid overcounting, as discussed in the following paragraphs.

Avoiding multiple counting. We combine maximality with another prac-


tical consideration and that is of avoiding multiple reading of the same occur-
rence. This is achieved by imposing the following constraint on the location
lists L1 and its conjugate L2 :

If f lat(L1 ) = f lat(L2 ),
then discSz(L1 ) must be different from discSz(L2 ).

To understand this constraint, consider the following scenario. Consider a


graph with a single edge

G({v1 , v2 }, {(v1 v2 )}

and let quorum K = 2. Further, let att(v1 ) = att(v2 ) = a, some fixed


attribute. A motif
M ({u1 , u2 }, {(u1 u2 )})
with
Lu1 = {v1 , v2 }
m l l
Lu2 = {v2 , v1 }
satisfies the quorum condition since v1 can be considered either as the start or
end vertex of the edge and similarly v2 can be considered either as the start
or end vertex of the edge, giving two apparent occurrences.
Thus it is important to avoid ‘over-counting’. But, first we compute the
subsets using Enrich(·) and clique(·), along with their conjugates. Then to
avoid the over-counting, we replace both L1 and its conjugate L2 that have
the same flat sets and the same size of discriminant with a single list L3 that
denotes all the edges of the conjugate relationship. It is made maximal, by
Topological Motifs 395

recognizing the maximal cliques on the vertices of f lat(L3 )(= f lat(L1 ) =


f lat(L2 )). Consider the example in Figure 12.18.

L5 = {v4 , v5 , {v9 , v11 }, v10 }


m l l l l
L5 = {v5 , v4 , v10 , {v9 , v11 }}

We first compute the maximal subset L′5 = Enrich(L5) and compute its
conjugate directly as a subset of L5 as follows:

L′5 = {{v9 , v11 }}


m l
L′′5 = {v10 }

Then to avoid multiple counting, L5 and its conjugate L5 give

L6 = {{v4 , v5 }, {v9 , v10 }, {v10 , v11 }}

with

att(L6 ) = red, discSz(L6 ) = 2, clq(L6 ) = 1.

Note that L6 , L′5 and L′′5 belong to Linit .

Back to concrete example. Consider the example G1 in Figure 12.16(a).


For each pair of attributes, the maximal lists are simply ‘read off’ the adja-
cency matrix B1 as follows where the conjugacy relationship of two lists is
shown by the ‘⇔’ symbol:

red blue green

red X L1 ⇔ L2 L4 ⇔ L3
blue - X L2 ⇔ L3
green - - X

Thus
Linit = {L1 , L2 , L3 , L4 }
and these lists, along with the conjugacy relations, are shown in Figure 12.17.
Next, consider the example G2 in Figure 12.16(b). Again, for each pair of
attributes, the maximal lists are simply ‘read off’ the adjacency matrix B2 as
follows:
red blue green

red L5 ⇔ L5 L1 ⇔ L2 L4 ⇔ L3
blue - X L2 ⇔ L3
green - - X
396 Pattern Discovery in Bioinformatics: Theory & Algorithms

However, here to avoid over counting, we replace L5 ⇔ L5 , by L6 as shown,


which represents the edges (encoded by L5 ⇔ L5 ). Also, notice that it has no
conjugate list. But before replacing with L6 , we Enrich(L5 ) to get L′5 and
collect its conjugate L′′5 . Thus

Linit = {L1 , L2 , L3 , L4 , L′5 , L′′5 , L6 }

and these lists, along with the conjugacy relations, are shown in Figure 12.18.

12.5.1.2 The iterative step


For each new compact (maximal) set L, possible new maximal compact
lists (say L′ ) are generated
1. using the Enrich(·) operations,
2. using the clique(·) operation, and,
3. set difference (\c operation) if there are inconsistencies in the meta-
graph G.
Note that each
L′ ⊂c L,
and we generate the conjugates of L′ using the conjugates of L.
Also, a collection of p maximal compact lists are used to generate new lists.
How do we choose the value of p? And, which collection of p maximal lists
do we choose? This is done in the most conservative way using the Ref ine(·)
procedure. It is the process of computing new maximal location lists through
compact list intersections. It is formally defined as follows.

Problem 18 (Refine(C)) The input to the problem is a collection of n com-


pact sets C = {C1 , C2 , . . . , Cn }. For a compact set S such that

S = C i1 ∩ c C i2 ∩ c . . . ∩ c C ip ,

we denote by
IS = {i1 , i2 , . . . , ip }.
Further, IS is maximal i.e., there is no I ′ with

IS ( I ′ ,

such that

S = Cj1 ∩c Cj2 ∩c . . . ∩c Cjp′ , where I ′ = {j1 , j2 , . . . , jp′ }.

The output is the set of all pairs (S, IS ).


Given a collection of n compact sets C = {C1 , C2 , . . . , Cn }, we propose a
three step solution to this problem.
Topological Motifs 397

1. Obtain C′ , a set of n flat sets from the given collection


C′ = {f lat(Ci ) | Ci ∈ C}.

2. We solve the maximal set intersection problem for C′ (which is defined


exactly as Ref ine(·) except that the sets are flat, thus ∩ instead of ∩c
is used).
3. For each solution (a flat set) computed in the last step, we reconstruct
a compact set. This is defined by the following problem.

Problem 19 Given p compact lists, L1 , L2 , . . . Lp , let


L′ = f lat(L1 ) ∩ f lat(L2 ) ∩ . . . ∩ f lat(Lp ).
Using L′ , compute LX given as
L X = L 1 ∩c L 2 ∩c . . . ∩c L p .

We solve this problem by constructing a neighborhood graph GN (L′ , EN ).


Again, we abuse notation slightly, and each element of L′ is (mapped
to) a vertex on this neighborhood graph. Further,
EN = {(v1 v2 ) | for each 1 ≤ i ≤ p, v1 , v2 ∈ L for some L ∈ Li }.
Then LX is obtained as follows:
LX = {L | L ⊂ L′ is a maximal clique on GN (L′ , EN )} .
See Section 12.4.8 for a discussion on maximal cliques.

Putting it all together. These steps are carefully integrated into one co-
herent procedure outlined as Algorithm (12).
Here InitMetaGraph(L) is a procedure that generates the meta-graph given
the collection of lists, L, and their conjugates. See Section 12.4.11 for details
on detecting inconsistencies in a connected component of the meta-graph.
Induct(L, p, L1 , . . . , Lp ) is a procedure that introduces the new list L to
the collection. Further, L ⊂c L1 , . . . , Lp and the conjugates of these p lists
are used to compute the conjugates of L.
The remainder of the algorithm is self-explanatory. Figures 12.17 and 12.18
give the solution to the input graphs of Figure 12.16.
Consider Figure 12.17. See also Figure 12.15 for the meta-graph. In the
connected component (to avoid clutter, we show only the nodes)
(L1 , L2 , L3 , L4 , L4 \c L1 ),
L1 and L4 are inconsistent. Thus the two maximal subgraphs that are con-
sistent are
398 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. (L1 , L2 , L3 , L4 \c L1 ) and
2. (L2 , L3 , L4 , L4 \c L1 ).
Note that L4 \c L1 is a complete subset of L4 . Further,

(L1 , L2 , L3 )

is not maximal. The other two connected components

1. (L1 ∩c L4 , L′2 , L′3 , L′4 ) and


2. (L1 \c L4 , L′2 , L′3 , L′4 )
give motifs with quorum K < 2 (the topology of the two connected compo-
nents of the input graph).
Consider Figure 12.18. In the two connected components

(L1 , L2 , L3 , L4 , L4 \c L1 , L6 ),

L1 and L4 are inconsistent. The two maximal consistent subgraphs are


1. (L1 , L2 , L3 , L4 \c L1 , L6 ), and
2. (L2 , L3 , L4 , L4 \c L1 , L6 ).
Notice that L6 is a complete subset of L4 .

Efficiency in practice. The collection of location lists L is partitioned by


attribute values att(L) for the Ref ine(·) operation, since clearly compact lists
with distinct attributes have empty intersections.
Also, if L is such that each L ∈ L is a singleton set, then the Enrich(L)
and clique(L) do not produce any new lists and these computations can be
skipped.
Topological Motifs 399

Algorithm 12 The Compact Motifs Discovery Algorithm

//input G(V, E)
DiscoverLists(L) //output L
{
Compute Linit
L ← Linit
InitMetaGraph(L)

Lnew ← Linit
WHILE (Lnew 6= ∅)
FOR EACH L ∈ Lnew {
FOR EACH L′ ∈ Enrich(L)
Induct(L′ , 1, L)
L ← clique(L); Induct(L′ , 1, L)

}
L ← L ∪ Lnew
Refine(L)
FOR EACH (L,p, L1 , . . . , Lp ) computed in Refine
Induct(L,p, L1 , . . . , Lp ) // L is L1 ∩c . . . ∩c Lp
ENDWHILE
}

Induct(L,p, L1 , . . . , Lp )
{
IF L =
6 ∅ THEN
IF L 6∈ L THEN
Add L to Lnew
Add vertex L to meta-graph
FOR 1 ≤ i ≤ p
IF Li and L are inconsistent THEN
Induct(L \c L′ , 1, L); Induct(L′ \c L, 1, L′ )
FOR EACH conjugate (using the p lists) L′ of L
IF L′ 6∈ L THEN
Add L′ to Lnew
Add vertex L′ & edge (LL′ ) to meta-graph
}

12.6 Related Classical Problems


Before we conclude the chapter, we relate the problem discussed here to
other classical problems. To summarize, the automated topological discov-
ery problem is abstracted as follows: Given an integer K(> 1) and a graph
400 Pattern Discovery in Bioinformatics: Theory & Algorithms

G(V, E) with labeled vertices and edges, the task is to discover at least K
subgraphs that are topologically identical in G. Such subgraphs are termed
topological motifs.
It is very closely related to the classical subgraph isomorphism problem
defined as follows [GJ79]:

Problem 20 (Subgraph isomorphism) Given graphs G = (V1 , E1 ) and H =


(V2 , E2 ). Does G contain a subgraph isomorphism to H i.e., a subset V ⊆
V1 and a subset E ⊆ E1 such that |V | = |V2 |, |E| = |E2 | and there exists
a one-to-one function f : V2 → V satisfying {v1 , v2 } ∈ E2 if and only if
{f (v1 ), f (v2 )} ∈ E?

Two closely related problems are as follows [GJ79].

Problem 21 (Largest common subgraph problem) Given graphs G = (V1 , E1 )


and H = (V2 , E2 ), positive integer K. Do there exist subsets E1′ ⊆ E1 and
E2′ ⊆ E2 with |E1′ | = |E2′ | ≥ K such that the two subgraphs G′ = (V1 , E1′ ) and
H ′ = (V2 , E2′ ) are isomorphic?

Problem 22 (Maximum subgraph matching problem) Given directed graphs


G = (V1 , E1 ) and H = (V2 , E2 ), positive integer K. Is there a subset R ⊆
V1 × V2 with |R| ≥ K such that for all < u, u′ >, < v, v ′ >∈ R, (u, v) ∈ A, if
and only if (u′ , v ′ ) ∈ A2 ?

All the three problems are NP-complete: each can be transformed from the
problem of finding maximal cliques 10 in a graph. The problem addressed in
this chapter is similar to the latter two problems. However our interest has
been in finding at least K isomorphs and all possible such isomorphs.

12.7 Applications
Understanding large volumes of data is a key problem in a large number
of areas in bioinformatics, and also other areas such as the world wide web.
Some of the data in these areas cannot be represented as linear strings, which
have been studied extensively with a repertoire of sophisticated and efficient
algorithms. The inherent structure in these data sets is best represented as
graphs. This is particularly important in bioinformatics or chemistry since it

10 The clique problem is a graph-theoretical NP-complete problem. Recall that a clique in


a graph is an induced subgraph which is a complete graph. Then, the clique problem is the
problem of determining whether a graph contains a clique of at least a given size k. The
corresponding optimization problem, the maximum clique problem, is to find the largest
clique in a graph.
Topological Motifs 401

might lead to the understanding of biological systems from indirect evidence


in the data. Thus automated discovery of a ‘phenomenon’ is a promising path
to take as is evidenced by the use of motif (substring) discovery in DNA and
protein sequences.
Here we give a brief survey of the current use of this discovery problem
to answer different biological questions. A protein network is a graph that
encodes primarily protein-protein interactions and this is important in un-
derstanding the computations that happen within a cell [HETC00, SBH+ 01,
MBV05]. A recurring topology or motif in such a setting has been inter-
preted to act as robust filters in the transcriptional network of Escherichia
coli [MSOI+ 02, SOMMA02]. It has been observed that the conservation of
proteins in distinct topological motifs correlates with the interconnectedness
and function of that motif and also depends on the structure of the topology
of all the interactions. This indicates that motifs may represent evolutionary
conserved topological units of cellular networks in accordance with the spe-
cific biological functions they perform [WOB03, LMF03]. This observation is
strikingly similar to the hypothesis in dealing with DNA and protein primary
structures.
To study complex relationships involving multiple biological interaction
types, an integrated Saccharomyces cerevisiae network in which nodes rep-
resent genes (or their protein products) and the edges represented different
biological interaction types was assembled [ZKW+ 05]. The authors examined
interconnection patterns over three to four nodes and concluded that most
of the motifs form classes of higher-order recurring interconnection patterns
that encompass multiple occurrences of topological motifs.
Topological motifs are also being studied in the context of structural units in
RNA [GPS03] and for structural multiple alignments of proteins [DBNW03].
For yet another application consider a typical chemical data set [CMHK05]: a
chemical is modeled as a graph with attributes on the vertices and the edges.
A vertex represents an atom and the attribute encodes the atom type; an
edge models the bond between the atoms it connects and its attribute en-
codes the bond type. In such a database, very frequent common topologies
could suggest the relationship to the characteristic of the database. For in-
stance, in a toxicology related database, the common topologies may indicate
carcinogenicity or any other toxicity.
In machine learning, methods have been proposed to search for subgraph
patterns which are considered characteristic and appear frequently: this uses
an a priori-based algorithm with generalizations from association discovery
[IWM03].
In massive data mining (where the data is extremely large, of the order of
tens of gigabytes) that include the world wide web, internet traffic and tele-
phone call details, the common topologies are used to discover social networks
and web communities, among other characteristics [Mur03]. In biological data
the size of the database is not as large, yet unsuitable for enumeration schemes.
When this scheme was applied researchers had to restrict their motifs to small
402 Pattern Discovery in Bioinformatics: Theory & Algorithms

sizes such as three or four vertices [MSOI+ 02].

12.8 Conclusion
The potential of an effective automated topological motif discovery is enor-
mous. This chapter presents a systematic way to discover these motifs using
compact lists. Most of the proofs of the lemmas and theorems presented in
this chapter are straightforward. The only tool they use is ‘proof by contra-
diction’, hence they have been left as exercises for the reader.
One of the burning questions is to compute the statistical significance of
these motifs. Can compact motifs/lists provide an effective and acceptable
method to compute the significance of these complex motifs? We leave the
reader with this tantalizing thought.

12.9 Exercises
Exercise 137 (Combinatorics in graphs) Consider the graph of Figure 12.4
with quorum = 2. Notice that every maximal motif occurs in component la-
beled 7. Thus the number of distinct collection of components is

26 − 1,

ignoring the empty set. Since there are four vertices in each component with
distinct attributes, the number of location lists of vertices can be estimated as

4(26 − 1) = 252.

However, this number is 212 as shown in Figure 12.8. How is the discrepancy
explained?

Hint: Notice that the motifs are (1) connected, (2) maximal and (3) satisfy
a given quorum. Thus the numbers are not captured by pure combinatorics,
although they are fairly close. The number of distinct motifs can be counted
by enumerating i edges, 1 ≤ i ≤ 6, out of 6 possible edges in a motif. Recall
Topological Motifs 403

that quorum is 2.
 
6
=6 (12.5)
5
 
6
= 15 (12.6)
4
  
6 16
= (12.7)
3 4
 
6
− 3 = 12 (12.8)
2
 
6
=6 (12.9)
1
 
6
4 =4 (12.10)
0

The discrepancies are explained as follows:

• (Equation (12.7)): When we choose a motif that has 3 edges, out of a


possible 6, then the motif could have 4 vertices or 3 vertices (see row 4
of Figure 12.17), hence the split shown in Equation (12.7) above of 20
as 16 + 4.

• (Equation (12.8)): Any two edges many not be adjacent, leading to


disconnected structures. There are 3 such configurations shown in Fig-
ure 12.6.

• (Equation 12.10): Since each connected component of the graph has four
distinct colored vertices, Equation (12.10) shows a count of 4 × 1 = 4.

Exercise 138 (Graph isomorphisms)

1. Given graphs G1 , G2 , G3 , show that if G1 is isomorphic to G2 and G2


is isomorphic to G3 , then G1 is isomorphic to G3 .

2. Identify the bijections (f ’s) that demonstrate that components (1), (4)
and (5) are isomorphic in Figure 12.10.

Exercise 139 (Motif occurrence) Consider the input graph in (a) below
with 10 vertices and two distinct attributes. The attribute of a vertex is de-
noted by its shape in this figure: ‘square’ and ‘circle’. Let quorum K = 2. A
motif is shown in (b). In (1)-(10) the occurrences of the motif on the graph
are shown in bold.
404 Pattern Discovery in Bioinformatics: Theory & Algorithms

v1 v2 v3 v4 v5 u1 u2 v1 v2 v3 v4 v5

v6 v7 v8 v9 v10 u3 u4 v6 v7 v8 v9 v10
(a) (b) (1)
v1 v2 v3 v4 v5 v1 v2 v3 v4 v5 v1 v2 v3 v4 v5

v6 v7 v8 v9 v10 v6 v7 v8 v9 v10 v6 v7 v8 v9 v10


(2) (3) (4)
v1 v2 v3 v4 v5 v1 v2 v3 v4 v5 v1 v2 v3 v4 v5

v6 v7 v8 v9 v10 v6 v7 v8 v9 v10 v6 v7 v8 v9 v10


(5) (6) (7)
v1 v2 v3 v4 v5 v1 v2 v3 v4 v5 v1 v2 v3 v4 v5

v6 v7 v8 v9 v10 v6 v7 v8 v9 v10 v6 v7 v8 v9 v10


(8) (9) (10)

1. For each occurrence i: (a) What is Oi ? (b) Define the map Fi (of
Definition (12.1)).

2. Is this motif maximal? Why?

3. Obtain the compact notation of the motif and the location lists of the
compact vertices.

Exercise 140 (Motif occurrence) Consider the input graph of Problem (139)
and let quorum K = 2. (a) below shows a motif with six vertices. The occur-
rences (in bold) of the motif are shown in (1)-(5). The dashed-bold indicates
multiple occurrences shown in the same figure: the occurrence of the motif is
Topological Motifs 405

to be interpreted as all the solid vertices and edges and one of the dashed edges
and the connected dashed/solid vertex.
u1 u2 u3 v1 v2 v3 v4 v5 v1 v2 v3 v4 v5

u4 u5 u6 v6 v7 v8 v9 v10 v6 v7 v8 v9 v10
(a) (1) (2)
v1 v2 v3 v4 v5 v1 v2 v3 v4 v5 v1 v2 v3 v4 v5

v6 v7 v8 v9 v10 v6 v7 v8 v9 v10 v6 v7 v8 v9 v10


(3) (4) (5)

1. How many distinct occurrences does the motif have?

2. Is this motif maximal? Why?

3. Obtain the compact notation of this motif and the location lists of the
compact vertices.

Hint: Note that the motif edge (u1 u5 ) must also be represented in the com-
pact motif notation giving at least two compact vertices U1 = {u5 } and
U2 = {u5 , u6 } although U1 ⊂ U2 .

Exercise 141 (Directed graph) Let G(V, E) be an undirected graph with


attributes on both vertices and edges. Devise a scheme to construct an undi-
rected graph
G′ (V ′ , E ′ )
with attributes only on the vertices so that a maximal motif occurring in G
can be constructed from a maximal motif occurring in G′ (V ′ , E ′ ).

1. What is the size of V ′ in terms of V ?

2. What is the size of E ′ in terms of E?

Hint: Fill in the details for a scheme outlined below.

1. (Annotate G): Introduce suffixes to common vertex and edge attributes.


406 Pattern Discovery in Bioinformatics: Theory & Algorithms

A A1
z x z1 x1 A1z1 A1x1
y y Bz1 Cx1
B C B C
By Cy
z x z2 x2
Bz2 Cx2
A A2 A2z2 A2x2
Input graph G. 1. Annotate G. 2. Generate V ′ .

2. (Generate V ′ ): For each vertex with attribute Aj and incident edge with
attribute xi , create a node with attribute Aj xi in G′ .

A
A1z1 A1x1 P Q z x
Bz1 Cx1 S R y
B C
By Cy T U
z x
Bz2 Cx2 S R
A2z2 A2x2 P Q A
3. Generate E ′ 4. Node labels in G′ . 5. Input graph G.
(A maximal motif in G′ and G)

3. (Generate E ′ ): For each pair of nodes with labels .yk and .yk , introduce
an undirected edge. Similarly for each pair of nodes with labels Ai . and
Ai ., introduce an undirected edge.

4. (Node labels in G′ ): Two node labels of the form Aj1 yk1 and Aj2 yk2 are
deemed to have the same node attribute in G′ .

Exercise 142 (Directed, labeled graph) Let G(V, E) be a directed graph


with attributes on both vertices and edges. Devise a scheme to construct an
undirected graph
G′ (V ′ , E ′ )
with attributes only on the vertices so that a maximal motif occurring in G
can be constructed from a maximal motif occurring in G′ (V ′ , E ′ ).

1. What is the size of V ′ in terms of V ?

2. What is the size of E ′ in terms of E?

Hint: Fill in the details for a scheme outlined below.


Topological Motifs 407

1. (Annotate G): Introduce suffixes to common vertex and edge attributes.

A A1
z x z1 x1 z1A1x1
y y yBz2
B C B C
yBz1 x2Cy
z x z2 x2
A A2 x1Cy z2A2x2
Input graph G. 1. Annotate G. 2. Generate V ′ .

2. (Generate V ′ ): For each incoming edge with attribute xi , vertex with


attribute Aj and outgoing edge with attribute yk , create a node with
attribute xi Aj yk in G′ .

A
z1A1x1 P z x
yBz2 Q
B y C
yBz1 x2Cy Q R
z x
x1Cy z2A2x2 R P A
3. Generate E ′ 4. Node labels in G′ . 5. Input graph G.
(A maximal motif in G′ and G)

3. (Generate E ′ ): For each pair of nodes with labels ··yk and yk ··, introduce
an undirected edge.
4. (Node labels in G′ ): Two node labels of the form xi1 Aj1 yk1 and xi2 Aj2 yk2
are deemed to have the same node attribute in G′ .

Exercise 143 Consider the initialization step discussed in Section 12.5.1.1


to compute Linit and recall property 2 as:
2. If M (VM , EM ) is a any maximal motif on the input graph and U ⊂ VM
is a compact vertex then

LU ⊂c L, for some L ∈ Linit .

Show that the same property holds even when motif M (VM , EM ) is not max-
imal.

Exercise 144 Using the definition of the intersection of two compact lists,
define the intersection of p compact lists.
408 Pattern Discovery in Bioinformatics: Theory & Algorithms

Exercise 145 Using the definitions of the set operations on compact lists,
show that given compact lists L1 , L2 and L3 , if

L3 ⊂c L1 , L2 ,

then
L3 ⊂c (L1 ∩c L2 ).

Exercise 146 Given a compact list L, do the following statements hold?

1. L ∪c L =c L.

2. L ∩c L =c L.

3. L \ L =c ∅.

Exercise 147 Given two compact lists L1 and L2 , prove the following state-
ments.

1. f lat(L1 ∩c L2 ) = f lat(L1 ) ∩ f lat(L2 ).

2. f lat(L1 ∪c L2 ) = f lat(L1 ) ∩ f lat(L2 ).

3. f lat(L1 \c L2 ) = f lat(L1 ) \ f lat(L2 ).

Hint: Use the definitions of the set operations.

Exercise 148 (a) Prove Theorem (12.2).

(b) Can you relax the definition of intersection ∩c , to ∩new in such a manner
that an intersection is not necessarily maximal, i.e., for maximal sets
L1 and L2 , L1 ∩new L2 may not be maximal.

Hint: (a) (1-4) Use proof by contradiction. (5) Construct an example.


(b) What aspect of the definition lends maximality to the intersection set?

Exercise 149 Let L′1 be a conjugate of L1 and and L′2 be a conjugate of L2 .


Then prove that
if L2 ⊂c L1 , then L′2 ⊂c L′1 .

Exercise 150 Construct an example to show that the enrich operation (Sec-
tion 12.4.8) is essential to obtain all the maximal topological motifs.
Topological Motifs 409

Hint: Consider the following input graph with three connected components.

The attribute of a vertex is represented by the shape of the vertex (square or


circle). Let quorum K = 2. How many maximal motifs occur on this graph?

Exercise 151 Let L0 ∈ Linit with att(L0 ) = att(L) and discSz(L0 ) > 1.
Then, show that
clique(L) = L ∩c L0 .

Hint: Given a graph G(V, E) and an attribute a, let

Va = {v ∈ V | att(v) = a}.

Further, let the input graph be such that the induced subgraph on Va has k
cliques with the following sizes of the cliques:

1 < d1 ≤ d2 ≤ . . . ≤ dk .

Does there exist L0 ∈ Linit with discSz(L0 ) > 1? Why? If yes, then deter-
mine the following characteristics of L0 : f lat(L0 ), discSz(L0 ), att(L0 ) and
clq(L0 ).

Exercise 152 We give a new definition of topological motif, by modifying


Definition (12.1) as follows: the mapping Hi is not mandated to be bijective
but just a total function or total mapping. Recall that bijective mapping
implies that every vertex u in the motif is mapped to a unique vertex v in
the occurrence O. A total mapping implies that multiple vertices in the motif
may be mapped to the same vertex in O.
What are the implications of this new motif definition? How does the dis-
covery algorithm change?

Hint: See Figure 12.13. Is the structure in (d) a topological motif in the input
graph in (a)? Are the motifs in (b) and (c) maximal by the new definition?

Exercise 153 (MCCS) Given a meta-graph, devise a method to extract all


the MCCSs.
410 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. What is the time complexity of the method?

2. Comment on the following approach to the problem:

(a) Obtain the largest subgraph G′ with no forbidden edges.


(b) Check for connectivity in G′ .

Hint: Given a graph, an independent set is a subset of its vertices that are
pairwise not adjacent. In other words, the subgraph induced by these vertices
has no edges, only isolated vertices. Then, the independent set problem is as
follows: Given a graph G and an integer k, does G have an independent set
of size at least k ? The corresponding optimization problem is the maximum
independent set problem, which attempts to find the largest independent set
in a graph. The independent set problem is known to be NP-complete.
The connectedness of a graph, on the other hand, can be computed in linear
time using a BFS or a DFS traversal.

Exercise 154 (Flat list) A compact list L where each L ∈ L is a singleton


set is termed a flat list.

1. Show that if L′ ⊂c L and L is a flat list, then so is L′ .

2. Consider the graph with seven connected components in Figure 12.4. We


follow a convenient notation to denote a vertex of this given graph G
as follows. Since each connected component has exactly one vertex of a
fixed color, a vertex denoted as cX uniquely identifies a vertex with color
X in component numbered c. We follow the convention:


 r denotes red,
b denotes blue,

X=

 g denotes green, and
d denotes black (dark).

Thus, v = 4r denotes the red vertex in the component numbered 4.


Hence, although slightly unusual, we adopt this notation for this input
graph.
In the following example, identify Linit . What can be said about each
L ∈ Linit ? Retrace the steps in the discovery algorithm and also re-
generate the maximal motifs from the exhaustive list of compact vertices
shown here.
Topological Motifs 411

modified input incidence matrix ⇒ Linit


1b 2b 3b 4b 5b 6b 7b 1g 2g 3g 4g 5g 6g 7g 1d 2d 3d 4d 5d 6d 7d b g d
1r 1 1 • •
2r 1 1 1 • • •
3r 1 1 • •

4 1 1 1 • • •
r r
5r 1 1 1 • • •
6r 1 1 • •
7r 1 1 1 • • •
1g 2g 3g 4g 5g 6g 7g 1d 2d 3d 4d 5d 6d 7d g d r
1b 1 1 • • •
2b 1 • •
3b 1 1 • •

b 4b 1 1 • • •
5b 1 • •
6b 1 1 • • •
7b 1 1 • • •
1d 2d 3d 4d 5d 6d 7d d r b
1g 1 • • •
2g 1 • •
3g 1 • • •

g 4g • •
5g 1 • • •
6g 1 • •
7g 1 • • •
r b g
1d • •
2d • • •
3d • • •

d 4d • •
5d • •
6d • • •
7d • • •
412 Pattern Discovery in Bioinformatics: Theory & Algorithms

⇒ Refine ⇒ Conjugate ⇒ Refine ...


b b g g d d b g d
1 • • • • • • • • • •
2 • • • • • • • • • • • • • • •
3 • • • • • • • • • •
⇒ ⇒ ⇒ ...
4 • • • • • • • • • • • • • • •
r
5 • • • • • • • • • • • • • • •
6 • • • • • • • • • •
7 • • • • • • • • • • • • • • • • • • • • • •

r r g g d d r g d
1 • • • • • • • • • • • • • • •
2 • • • • • • • • • •
3 • • • • • • • • • •
⇒ ⇒ ⇒ ...
b 4 • • • • • • • • • • • • • • •
5 • • • • • • • • • •
6 • • • • • • • • • • • • • • •
7 • • • • • • • • • • • • • • • • • • • • • •

r r b b d d r b d
1 • • • • • • • • • • • • • • •
2 • • • • • • • • • •
3 • • • • • • • • • • • • • • •
⇒ ⇒ ⇒ ...
g 4 • • • • • • • • • •
5 • • • • • • • • • • • • • • •
6 • • • • • • • • • •
7 • • • • • • • • • • • • • • • • • • • • • •

r r b b g g r b g
1 • • • • • • • • • •
2 • • • • • • • • • • • • • • •
3 • • • • • • • • • • • • • • •
⇒ ⇒ ⇒ ...
d 4 • • • • • • • • • •
5 • • • • • • • • • •
6 • • • • • • • • • • • • • • •
7 • • • • • • • • • • • • • • • • • • • • • •
Topological Motifs 413

... Refine (cont) ⇒ Conjugates ⇒ Refine


b b g g d d
1 • • • • • • • • • •
2 • • • • • • • •
3 • • • • • • • • • •
... ⇒ ⇒
4 • • • • • • •
r
5 • • • • • • • • •
6 • • • • • • • • • •
7 • • • • • • • • • • • • • • • • • • • • • • • • • • •
r r g g d d
1 • • • • • • • •
2 • • • • • • • • • •
3 • • • • • • • • • •
... ⇒ ⇒
4 • • • • • • • •
b
5 • • • • • • • • • •
6 • • • • • • • •
7 • • • • • • • • • • • • • • • • • • • • • • • • • • •
r r b b d d
1 • • • • • • • •
2 • • • • • • • • • •
3 • • • • • • • •
... ⇒ ⇒
4 • • • • • • • • •
g
5 • • • • • • • • •
6 • • • • • • • • • •
7 • • • • • • • • • • • • • • • • • • • • • • • • • • •
r r b b g g
1 • • • • • • • • • •
2 • • • • • • • •
3 • • • • • • • •
... ⇒ ⇒
4 • • • • • • • • •
d
5 • • • • • • • • • • •
6 • • • • • • • •
7 • • • • • • • • • • • • • • • • • • • • • • • • • • •
414 Pattern Discovery in Bioinformatics: Theory & Algorithms

Exercise 155 Consider the following two examples where the vertices in the
graphs have only one attribute. Assume quorum K = 2. Compute all the
maximal motifs in their compact form.

v1 v4
v1 v4
v5
v3 v5 v2 v3
v2

Adjacency matrix B1 Adjacency matrix B2


v1 {v2 , v3 } v1 {v2 , v5 }
v2 {v1 } v2 {v1 , v3 }
v3 {v1 , v4 , v5 } v3 {v2 , v4 }
v4 {v3 } v4 {v3 , v5 }
v5 {v3 } v5 {v1 , v4 }

(a) Example 1. (b) Example 2.


Hint: Linit for Example 1.
L11 = {{v1 }, {v3 }, {v2 , v3 }, {v1 , v4 , v5 }}
m l l l
L11 = {{v2 , v3 }, {v1 , v4 , v5 }, {v1 }, {v3 }}
L12 = {{v1 , v2 }, {v1 , v3 }, {v3 , v5 }, {v3 , v4 }}
Linit for Example 2.
L21 = {{v2 , v4 }, {v1 , v3 }, {v2 , v5 }, {v1 , v5 }, {v3 , v4 }}
m l l l l l
L22 = {{v1 }, {v2 }, {v3 }, {v4 }, {v5 }}
Do the following (motifs) satisfy the quorum constraint? Are they maximal?
u1 u4 u1 u1

u2 u3 u2 u3 u2

Exercise 156 ∗∗ Let G be a graph and let M be the collection of all topological
motifs on G that satisfy a quorum K. Design an algorithm to update M when
a new vertex v and all its incident edges are added to G.
Hint: This is also known as an incremental discovery algorithm.
Topological Motifs 415

Comments
The reader might find this the most challenging chapter in the book. But
he/she can seek solace in the fact that even experts have had difficulty with
this material. However, the ideas presented in this chapter are simple, though
perhaps not obvious.
It is amazing to what lengths scientists are willing to inconvenience them-
selves to gain only a sliver of understanding of the nature of biology–brute-
force enumeration of topological motifs have been used, albeit for ones with
very few vertices.
The natural question that might arise in a curious reader’s mind: Why not
use multiple BFS (breadth first search) traversals to detect the recurring mo-
tifs? In fact, a survey of literature will reveals that such approaches have been
embraced. However, the problem of explosion due to occurrence-isomorphism
will cripple such a system in data sets with rampant common attributes. Of
course, it is quite possible to generate instances of input that would be debil-
itating even to the approach presented here.
Chapter 13
Set-Theoretic Algorithmic Tools

The devil is in the details,


and so is science.
- anonymous

13.1 Introduction

Time and again, one comes across the need for an efficient solution to a
task that leaves one with an uncomfortable sense of déjà vu. To alleviate such
discomfort, to a certain extent, most compilers provide libraries of routines
that perform oft-used tasks. Most readers may be familiar with string or
input-output libraries. Taking this idea further, packages such as Maxima, a
symbolic mathematics system, R, a language and environment for statistical
computing,1 and other such tools provide invaluable support and means for
solving difficult problems.
In the same spirit what are the nontrivial tasks, requiring particular atten-
tion, in the broad area of pattern discovery that one encounters over and over
again?
This chapter discusses sets, their interesting structures (such as orders,
partial orders) and efficient algorithms for simple, although nontrivial, tasks
(such as intersections, unions). This book treats lists as sets. Thus a location
list, which is usually a sorted list of integers (or tuples), is treated as a set for
practical purposes.

1 Moreabout Maxima: http://maxima.sourceforge.net/


More about R: http://www.r-project.org/.

417
418 Pattern Discovery in Bioinformatics: Theory & Algorithms

13.2 Some Basic Properties of Finite Sets


Let the sets be defined on some finite alphabet

Σ = {σ1 , σ2 , . . . , σL },

where
L = |Σ|.
Let S1 and S2 be two nonempty sets. Then only one of the following three
holds:
1. S1 and S2 are disjoint if and only if

S1 ∩ S2 = ∅.

For example,

S1 = {a, b, c},
S2 = {d, e},

are disjoint since


{a, b, c} ∩ {d, e} = ∅.

2. Without loss of generality, S1 is contained or nested in S2 if and only if

S1 ⊆ S2 .

For example,
S1 = {a, b}
is contained in
S2 = {a, b, d, e},
since
{a, b} ⊂ {a, b, d, e}.

3. S1 and S2 straddle if and only if the two set differences are nonempty,
i.e.,

S1 \ S2 6= ∅ and
S2 \ S1 6= ∅.

For example,

S1 = {a, b, c, e}, and


S2 = {a, b, d}
Set-Theoretic Algorithmic Tools 419

straddle since
{a, b, c, e} \ {a, b, d} = {c, e}, and
{a, b, d} \ {a, b, c, e} = {d}.
In other words, for some x, y ∈ Σ,
x ∈ S1 \ S2 and
y ∈ S2 \ S1 .

Two sets S1 and S2 are said to overlap if


1. without loss of generality, S1 is contained in S2 , or
2. S1 and S2 straddle.
Given a collection of sets S, the following collection of two-tuples are termed
the containment information or C:
C(S) = {(S1 , S2 ) | S1 ⊂ S2 and S1 , S2 ∈ S}. (13.1)

13.3 Partial Order Graph G(S, E) of Sets


For a collection of sets S, define a directed graph,G(S, E), where with a
slight abuse of notation, each vertex in the graph is an element of S. The
edge set E is defined as follows. If
S1 ( S2
holds, then there is a directed edge from S2 to S1 , written as

S2 S1 ∈ E.
The graph G(S, E) is called S’s partial order (graph). Let
S2 S1 ∈ E,
Then the following terminology is used.
1. S1 is called the child of S2 .
2. S2 is called the parent of S1 .
3. If two nodes S1 and S3 have a common parent S2 , i.e.,

S2 S1 , S2 S3 ∈ E,
then S1 and S3 are called siblings.
420 Pattern Discovery in Bioinformatics: Theory & Algorithms

4. If there is a directed path from S2 to S1 then

(a) S1 is a descendent of S2 and


(b) S2 is an ascendant of S1 .

LEMMA 13.1
(The descendent lemma) If S1 is a descendent of S2 , then

S1 ( S2 .

Let (see Equation (13.1))

C(S) = C(G(S, E)).

Next, we ask the question:

Can some edges be removed without losing the subset information on


any pair of sets?

In other words, is there some


E ′ ⊂ E,
such that
C(G(S, E)) = C(G(S, E ′ )).

13.3.1 Reduced partial order graph


Consider the partial order graph

G(S, E).

Define a set of edges Er (⊆ E) as follows:

S2 S1 ∈ E r

S ′ is a descendant of S2 , and

S2 S1 ∈ Er ⇔ there is no S ′ ∈ S such that
S1 is a descendant of S ′ .
The transitive reduction2 of G(S, E) is written as

G(S, Er ).

For convenience, in the rest of the chapter this is also called the reduced partial
order (graph) of S. Figure 13.1 gives a concrete example.

2A binary relation R is transitive if (A, B), (B, C) ∈ R =⇒ (A, C) ∈ R.


Set-Theoretic Algorithmic Tools 421

a,b,c,d a,b,c,d

a,b,c a,b,c

a,b b,c a,b b,c


(a) Partial order G(S, E). (b) Reduced partial order G(S, Er ).

FIGURE 13.1: Here S = {{a, b, c, d}, {a, b, c}, {a, b}, {b, c}}. Note that
some edges are missing in (b), yet C(G(S, E)) = C(G(S, Er )).

Note that the reduced partial order encodes exactly the same subset infor-
mation as the partial order but with possibly fewer edges, i.e.,

E ⊂ Er .

In fact, a stronger result holds.

LEMMA 13.2
(Unique transitive reduction lemma) Er , the smallest set of edges sat-
isfying
C(G(S, E)) = C(G(S, Er )),
is unique.

The proof is left as Exercise 157 for the reader.

13.3.2 Straddle graph


We next address the problem of partitioning S such that any pair in each
partition straddle.
The sets (nodes) in S that straddle, can be partitioned using the reduced
partial order graph
G(S, Er ).
Each partition is a connected graph, induced by any set (node), say S, in
the partition. This is best defined as an iterative process as follows which
constructs one connected graph corresponding to one of the partitions of S.
422 Pattern Discovery in Bioinformatics: Theory & Algorithms

Algorithm 13 Straddle Graph Construction


Initialize VS = {S} and ES = ∅.
REPEAT
IF S ′ ∈ S \ VS and S ∈ VS have a common child, THEN
Add S ′ to VS
Add edge SS ′ to ES
UNTIL no new node can be added to VS

We call this graph the straddle graph induced by S, written as,

Gstraddle (VS , ES ).

Thus the straddle graph is defined on some subset of S that straddle. Fig-
ure 13.2 shows a concrete example. Let

S = {{a}, {b}, {c}, {d}, {e},


{a, b}, {b, c}, {c, d}, {c, e},
{a, b, c},
{a, b, c, d}, {a, b, c, e},
{a, b, c, d, e}}.

The reduced partial order graph, G(S, Er ) is shown in (a); (b) shows the edges
that connect nodes with common children (these edges are shown as dashed
edges), and (c) shows the two nonsingleton connected straddle graphs. The
singleton straddle graphs (graphs whose node set V has only one element)
are:

G({{a, b, c, d, e}}, ∅),


G({{a, b, c}}, ∅),
G({{a}}, ∅), G({{b}}, ∅), G({{c}}, ∅), G({{d}}, ∅), G({{e}}, ∅).

LEMMA 13.3
(Unique straddle graph lemma) If

S, S ′ ∈ VS ,

then
Gstraddle (VS , ES ) = Gstraddle (VS ′ , ES ′ ).

This can be verified and we leave that as Exercise 160 for the reader.
Set-Theoretic Algorithmic Tools 423

a,b,c,d,e a,b,c,d,e

a,b,c,d a,b,c,e a,b,c,d a,b,c,e

a,b,c a,b,c

b,c c,d
a,b b,c c,d c,e a,b c,e

a b c d e a b c d e
(a) Reduced partial order. (b) Edges (dashed) connecting nodes
that share a common child.

a,b,c,d a,b,c,e

c,d

b,c
a,b c,e
(c) The two connected straddle graphs.

FIGURE 13.2: A reduced partial order graph with dashed edges between
nodes that have common children.

13.4 Boolean Closure of Sets


Let S be a collection of sets. Then the boolean closure of sets, B(S), is
defined in terms of its intersection and union closures.

13.4.1 Intersection closure

The intersection closure B∩ (S) is defined as follows:


B∩ (S) = S1 ∩ S2 ∩ . . . ∩ Sl Si ∈ S, 1 ≤ i ≤ l, for some l ≥ 1 .

In other words, this is the collection of all possible intersection of the sets.
For example, let

S = {{a, b, c, d}, {a, b, c, e}, {b, c, f }, {e, f }}.


424 Pattern Discovery in Bioinformatics: Theory & Algorithms

Then,
B∩ (S) = { {e}, ({a, b, c, e} ∩ {e, f })
{f }, ({b, c, f } ∩ {e, f })
{b, c}, ({a, b, c, d} ∩ {a, b, c, e} ∩ {b, c, f }, or
{a, b, c, d} ∩ {b, c, f }, or
{a, b, c, e} ∩ {b, c, f })
{a, b, c}, ({a, b, c, d} ∩ {a, b, c, e})
{e, f }, (in S)
{b, c, f }, (in S)
{a, b, c, d}, (in S)
{a, b, c, e} }. (in S)

13.4.2 Union closure


The union closure B∪ (S) is defined as follows:
 
 for some l ≥ 1 and for each Sj 
B∪ (S) = S1 ∪ S2 ∪ . . . ∪ Sl there exists some Sk such that ,
Sj ∩ Sk 6= ∅
 

Note that we define the union only over a collection of overlapping sets. For
example, when

S = {{a, b, c, d}, {a, b, c, e}, {b, c, f }, {e, f }, {g, h}},

then,
B∪ (S) = { {e, f }, (in S)
{g, h}, (in S)
{b, c, f }, (in S)
{a, b, c, d}, (in S)
{a, b, c, e}, (in S)
{b, c, e, f }, ({b, c, f } ∪ {e, f })
{a, b, c, d, e}, ({a, b, c, d} ∪ {a, b, c, e})
{a, b, c, d, f }, ({a, b, c, d} ∪ {b, c, f })
{a, b, c, e, f }, ({a, b, c, e} ∪ {b, c, f } ∪ {e, f })
{a, b, c, d, e, f } }. ({a, b, c, d} ∪ {a, b, c, e} ∪ {b, c, f })

Can the following be a member of B∪ (S),

{a, b, c, d} ∪ {g, h} ?

The answer is no, since the two sets have no overlap. Note that the set {g, h}
does not overlap with any of the other set in S.
Then, how about the following:

{a, b, c, d, e, f } = ({a, b, c, d} ∪ {e, f }. (13.2)


Set-Theoretic Algorithmic Tools 425

But
{a, b, c, d} ∩ {e, f } = ∅,
i.e., the two have no overlap, so this union also cannot be considered for
membership in B∪ (S). However, consider
{a, b, c, d}, {a, b, c, e}, {b, c, f }.
Here
{a, b, c, d} ∩ {a, b, c, e} =6 ∅ and
{a, b, c, e} ∩ {b, c, f } 6= ∅.
Thus the union of the three sets can be considered as an element of the union
closure:
{a, b, c, d, e, f } = {a, b, c, d} ∪ {a, b, c, e} ∪ {b, c, f }. (13.3)
Note that Equations (13.2) and (13.3) give rise to the same set, but only the
second union is acceptable by the definition of our union closure.

Back to boolean closure. The boolean closure is the union of the inter-
section closure, B∩ (S), and the union closure, B∪ (S),
B(S) = B∩ (S) ∪ B∪ (S),
Thus, boolean closure is all the possible intersection and union of the over-
lapping sets.

LEMMA 13.4
(Straddle graph properties lemma) Let S be a collection of sets defined
on alphabet Σ such that no two sets in S straddle.
1. Then the boolean closure, B(S), is the same as S, i.e.,
S = B(S) = B∩ (S) = B∪ (S).

2. Then the reduced partial order graph


G(S, Er ) = G(B(S), Er )
= G(B∩ (S), Er )
= G(B∪ (S), Er )
is acyclic.
3. Then the elements of Σ can be consecutively arranged as a string say s,
such that for each S ∈ S, its members appear consecutive3 in s.

The proof is left as Exercise 161 for the reader.

3 We later introduce the notation ‘s ∈ F (S)’, to articulate the same condition.


426 Pattern Discovery in Bioinformatics: Theory & Algorithms

13.5 Consecutive (Linear) Arrangement of Set Members


One of the statements in Lemma (13.4) is about the consecutive arrange-
ment of the alphabet Σ. We study the necessary and sufficient condition for
a consecutive arrangement by considering a a classical problem studied in
combinatorics.

Problem 23 (The general consecutive arrangement (GCA) problem)


Given a finite set Σ and a collection S of subsets of Σ, find

F(S)

the collection of permutations s of Σ in which the members of each subset


S ∈ S appear as a consecutive substring of s.
For example, consider

S1 = {{a, b, c}, {a, c, d}, {a, c}}.

Here
Σ = {a, b, c, d}.
It can be verified that

F(S1 ) = {b a c d, d c a b, b c a d, d a c b}.

Note that each set S ∈ S is such that its members appear consecutively in
each s ∈ F(S1 ). Next, consider

S2 = {{a, b, c}, {a, c, d}, {b, d}}.

Note that
F(S2 ) = ∅,
i.e., it is not possible to arrange the members of Σ in such a way that each
set S ∈ S appears consecutively.
Can a systematic approach be developed to solve the GCA problem. PQ
tree is a data structure introduced by Booth and Leukar [BL76] to solve the
GCA problem in linear time.

13.5.1 PQ trees
A PQ tree, T , is a directed acyclic graph with the following properties.
1. T has one root (no incoming edges).
2. The leaves (no outgoing edges) of T are labeled bijectively by Σ.
Set-Theoretic Algorithmic Tools 427

a b c d e f f e d b a c
(a) (b)
T with F (T ) = a b c d e f . T ′ with F (T ′ ) = f e d b a c.

FIGURE 13.3: Two equivalent PQ trees, T ′ ≡ T and their frontiers.

3. It has two types of internal (not leaf) nodes:

P-node: The children of a P-node occur in no particular order.


This node is denoted by a circle.
Q-node: The children of a Q-node appear in a left to right or right
to left order. This node is denoted by a rectangle.

An example PQ tree is shown in Figure 13.3(a) where

Σ = {a, b, c, d, e, f }.

The root node is a P-node and it has six (|Σ|) leaves mapped bijectively to
the elements of Σ. The frontier of a tree T denoted by

F (T ),

is the permutation of Σ obtained by reading the labels of the leaves from left
to right. Note that this definition is valid for any tree, even when the tree is
not a PQ tree.
The frontiers of the PQ trees are shown in Figure 13.3. Two PQ trees T
and T ′ are equivalent, denoted

T ≡ T ′,

if one can be obtained from the other by applying a sequence of the following
transformation rules:

1. Arbitrarily permute the children of a P-node, and

2. Reverse the children of a Q-node.


428 Pattern Discovery in Bioinformatics: Theory & Algorithms

Any frontier obtainable from a tree equivalent with T is called consistent


with T , and F(T ) is defined as follows:

F(T ) = {F (T ′ )|T ′ ≡ T }.

In Figure 13.3,

F(T ) = F(T ′ )
= {a b c d e, a b c e d, c b a d e, c b a e d,
d e a b c, d e c b a, e d a b c, e d c b a}.

In the remainder of the discussion, we omit the directions of the edges in


the PQ tree, since the direction is obvious from the PQ tree diagram.

Back to consecutive arrangement. Recall from last section:


If no two sets in S straddle, then the reduced partial graph, G(S, Er ), is
acyclic.
The following is straightforward to verify and we leave this as Exercise 163
for the reader.

LEMMA 13.5
(Frontier lemma) For a tree T , whose leaves are labeled bijectively with the
elements of Σ and each internal node represents the collection of the labels of
the leaf nodes reachable from this node, say

S ∈ S,

a consecutive arrangement of Σ that respects S is the frontier F (T ).

Since G(S, Er ) is some tree, then a consecutive arrangement is the frontier

F (G(S, Er )).

Next, consider a pair of straddling sets as follows:

S = {{a, b, c, d, e}, {d, e, f, g, h}}.

As we have seen before, s1 , s2 ∈ F(S), where

s1 = a b c d e f g h
s2 = a b c d e f g h

This shows that some straddling sets do allow a consecutive arrangement.


What is that condition(s) on straddling sets?
Set-Theoretic Algorithmic Tools 429

13.5.2 Straddling sets


We give an exposition on the solution to the GCA problem using boolean
closure of S,
B(S),
and its reduced partial order graph,
G(B(S), Er ).
Note that this is not an algorithm to solve the GCA problem. Actually, it
is possible to translate the exposition into an algorithm, but is not the most
efficient.
If every node in G(B(S), Er ) has at most one parent, then it is a strict
hierarchy or a tree and the frontier
F (G(B(S), Er ))
is the consecutive arrangement. Can some nodes have multiple parents and
yet a consecutive arrangement of the elements possible?
A graph G(V, E) is a chain or a total order if all its vertices can be arranged
along a path. In other words, there are exactly two end vertices, with degree
one, and every other vertex has degree two.
The following theorem gives the conditions under which
F(S) 6= ∅,
i.e., a linear consecutive arrangement of elements is possible.

THEOREM 13.1
(Set linear arrangement theorem) Given S, a collection of sets on al-
phabet Σ,
F(S) 6= ∅,
(i.e., there exists a consecutive arrangement of the n members of Σ) if and
only if every straddle graph,
Gstraddle (VS , ES ),
for each S ∈ S, in the reduced partial order graph
G(B(S), Er )
is a chain.

PROOF For Si ∈ S, let the straddle graph be


Gstraddle (VSi , ESi ).
We make the two following observations.
430 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. For each Si , since


Gstraddle (VSi , ESi )
is a chain, without loss of generality, the chain is given as:
Si1 Si2 . . . Sil ,
for some l. This gives a consecutive arrangement of the elements of the
sets in VSi , written as,
sSi = (Si′0 )-(Si′1 )-(Si′2 )-(Si′3 )- . . . -(Si′l−2 )-(Si′l−1 )-(Si′l )-(Si′l+1 ) (13.4)
where
Si′0 ∪ Si′1 = S i1 ,
Si′1 ∪ Si′2 ∪ Si′3 = S i2 ,
Si′2 ∪ Si′3 ∪ Si′4 = S i3 ,
Si′3 ∪ Si′4 ∪ Si′5 = S i4 ,
...
Si′l−2 ∪ Si′l−1 ∪ Si′l = Sil−1 ,
Si′l ∪ Si′l+1 = S il .
and (Si′j ) is simply a consecutive arrangement of the members of Si′j
in any order. However, each (Si′j ) must follow the left to right order
as given by the index ordering, i.e., if j < k then (Si′j ) is to the left of
(Si′k ).
The proof that such a consecutive arrangement sSi is possible is left as
Exercise 165 for the reader.
2. We claim that there exists some arrangement of all the n members of
Σ, without loss of generality, as
σ 1 σ 2 . . . σn .
We call this the consensus arrangement, and this arrangement satisfies
the following property. For each
1 ≤ p < q < r ≤ n,
if for some i,
σp ∈ Sij , σq ∈ Sik , and σr ∈ Siℓ ,
then the following holds:
j < k < ℓ or ℓ < k < j
must hold. In other words, the ordering in the elements in the consensus
arrangement does not violate the ordering suggested by each sSi .
Again, the proof that such a consensus arrangement exists is left as
Exercise 165 for the reader.
Set-Theoretic Algorithmic Tools 431

FIGURE 13.4: Since the collection of sets is closed under union of strad-
dling sets, if a node has two parents then it must be part of a pyramid structure
as shown.

(a) One pyramid. (b) Stack of 3 pyramids.

FIGURE 13.5: Reduced partial order when the sets can be linearly ar-
ranged.

This concludes the proof.

Figures 13.4 and 13.5(a) show examples of a single pyramid and Figure 13.5(b)
shows possible stacking of three pyramids in a reduced partial order graph.

So what does a Q-node encapsulate? Consider the following collection


of sets:

S=
{ {a, b}, {b, c}, {c, d}, {d, e},
{a, b, c}, {b, c, d}, {c, d, e},
{a, b, c, d}, {b, c, d, e},
{a, b, c, d, e} }.

Does there exist a sequence s of {a, b, c, d, e} in which the members of each


subset S ∈ S appear as a consecutive substring of s? It is easy to see that
432 Pattern Discovery in Bioinformatics: Theory & Algorithms

there are exactly two such sequences F(S) = {s1 , s2 } where


s1 = a b c d e, and
s2 = e d c b a.

The reduced partial order graph is shown in Figures 13.6(a) and (b). The
exact same sets are denoted by a single Q-node as shown in (c) of the figure.
If the number of leaf nodes in the reduced partial order graph is k, then the
number of nonleaf nodes nodes is given as:
O(k2 ).

a,b,c,d,e a,b,c,d,e

a,b,c,d b,c,d,e b,c,d,e a,b,c,d

a,b,c b,c,d c,d,e c,d,e b,c,d a,b,c

a,b b,c c,d d,e d,e c,d b,c a,b

a b c d e e d c b a
(a) a b c d e. (b) e d c b a.

a b cd e
(c) Q-node of a PQ tree.

FIGURE 13.6: Sets shown at the nodes of a reduced partial order graph of
a sequence (a) and its reversal (b). This entire information can be represented
by a single Q-node of a PQ tree as shown in (c).

Back to the PQ tree. To summarize:


If the reduced partial order graph of the boolean closure of I,
G(B(I), Er ),
is such that graph can be partitioned into
Set-Theoretic Algorithmic Tools 433

a,b,c,d,e,f,g
a,b,c,d,e,f,g
e,f,g

a,b,c,d e,f f,g


a,b,c,d

a b c d e f g a b c d e f g
(a) The reduced partial order graph. (b) The PQ tree.

FIGURE 13.7: If the elements (a, b, c, d, e, f, g) can be organized as a


sequence, then the reduced partial order graph can be encoded as a PQ tree
as shown in the above example.

• ‘pyramid’ structures and


• ‘tree’ structures,
then the elements of Σ can be linearly arranged as some string s.
By partition we mean that the edge set

Er = E1 ∪ E2 . . . ∪ El ,

where
Ei ∩ Ej = ∅ with 1 ≤ i < j ≤ l,
and each partition is defined on Ei . By ‘tree structure’ we mean that no node
in the (sub)graph has more than one parent.
Thus we conclude:
Given I, if the elements of X, can be organized as a linear string, then
the reduced partial order graph can be encoded as a PQ Tree.
As the final example, consider

S = { {a, b, c, d}, (13.5)


{e, f }, {f, g}, {e, f, g}, (13.6)
{a, b, c, d, e, f, g}}. (13.7)

The reduced partial order graph of this collection is shown in Figure 13.7(a).
It can be partitioned into one ‘pyramid structure’ (shown enclosed by a dashed
rectangle) and a ‘tree structure’.
The PQ encoding is as follows: The tree has one Q-node, two P-node and
seven leaf nodes. Set shown as (13.5) is encoded as a single P-node, sets shown
as (13.6) are encoded as a single Q-node.
434 Pattern Discovery in Bioinformatics: Theory & Algorithms

13.6 Maximal Set Intersection Problem (maxSIP)


Consider the following scenario. Let
Σ = {a, b, c, d, e, f, g},
be a finite alphabet with a collection of sets, C, defined on it as follows:
C = {C1 , C2 , C3 , C4 , C5 }, where
C1 = {g, b, c, d, e, a},
C2 = {a, f, c, d},
C3 = {a, b, d, e, c},
C4 = {b, d, e},
C5 = {f, a, b, c, d, e}.
We seek all ‘maximal’ set intersections, denoted as S, of at least K (called
quorum) elements of C, denoted as IS . To gain an informal understanding of
maximality of the pair (S, IS ), consider the following two cases. Let quorum
K = 3.
1. Let S ′ = {b, d},
then index set IS ′ = {1, 3, 4, 5},
with |IS ′ | ≥ 3. However, (S ′ , IS ′ ) is not maximal since there exists
S1 ⊃ S ′
with
S1 = {b, d, e}, and IS1 = IS ′ = {1, 3, 4, 5}.
The pair (S1 , IS1 ) is maximal.
2. Let an index set be I ′ = {1, 2, 5}, then the corresponding set S2
S2 = {a, d, c}.
Clearly, the pair (S2 , I ′ ) is not maximal, since
S ⊂ C3 and the index set must contain 3.
If IS2 = {1, 2, 3, 5}, then the pair (S2 , IS2 ) is maximal.
This demonstrates that both a set S and an index set I must be ‘expanded’
to ensure maximality.
Further, the only other maximal pair for this example is
S3 = {a, b, c, d, e} with IS3 = {1, 3, 5}.
The problem is formally stated as follows.
Set-Theoretic Algorithmic Tools 435

Problem 24 (Maximal Set Intersection Problem (maxSIP(C,K))) The input


to the problem is a collection of n sets C = {C1 , C2 , . . . , Cn }, where each
Ci ⊂ Σ, |Σ| = m, and a quorum K > 0. For a set S such that

S = C i1 ∩ C i2 ∩ . . . ∩ C ip ,

we denote by IS the set of indices

IS = {i1 , i2 , . . . , ip }.

Further, IS is maximal i.e., there is no I ′ with

IS ( I ′ ,

such that

S = Cj1 ∩ Cj2 ∩ . . . ∩ Cjp′ , where I ′ = {j1 , j2 , . . . , jp′ }.

The output is the set of all maximal pairs (S, IS ) such that |S| ≥ K.

Further, we say S ′ is nonmaximal with respect to S if

S ′ ( S, and IS ′ = IS .

13.6.1 Ordered enumeration trie


Given an input collection of sets, how do we detect all such maximal pairs?
We propose a simple scheme to enumerate the maximal pairs (S, IS ). To
avoid multiple enumerations, we follow some arbitrary (but fixed) order on
the alphabet. Without loss of generality, let

σ 1 < σ 2 < . . . < σm .

Given a set
S = {σi1 < σi2 < . . . < σil },
define a sequence, seq(S), as

seq(S) = σi1 σi2 . . . σil .

Given an input C and a quorum K, consider the following collection of se-


quences:  
(S, IS ) is a maximal pair
SC,K = S . (13.8)
for input C and quorum K
Also,
Sq C,K = {seq(S) | S ∈ SC,K } . (13.9)
436 Pattern Discovery in Bioinformatics: Theory & Algorithms

Maximal Pairs (S, IS )


a b
S1 = {a, b, c, d, e},
IS1 = {1, 3, 5}.
S2 = {a, c, d}, b c d
IS2 = {1, 2, 3, 5}.
S3 = {b, d, e}, c d e
IS3 = {1, 3, 4, 5}.
d a bde
Sequences, S=
{seq(S1 ) = abcde,
seq(S2 ) = acd, e bcde cd
seq(S3 ) = bde}.

(a)The concrete example. (b) The trie for S. (c) The reduced trie.

FIGURE 13.8: Continuing the concrete example of Section 13.6: The set
of sequences S, following the ordering a < b < c < d < e < f < g and the
corresponding trie in (b). The reduced trie where each internal node has at
least two children is shown in (c).

The trie, TC,K , of the elements of Sq C,K is termed the ordered enumeration
trie or simply the trie of the input C with quorum K. Figure 13.8 displays
the trie for a simple example. What is the size of this trie?

LEMMA 13.6
(Enumeration-trie size lemma) The number of nodes in trie TC,K is no
more than

X X
|s| = |S|,
s∈Sq C,K S∈SC,K

where SC,K , Sq C,K are as defined in Equations (13.8) and (13.9).

13.6.2 Depth first traversal of the trie


Given C and K, we propose an algorithm, which traverses the ordered enu-
meration trie, TC,K , in a depth first order. The pseudocode in Algorithm (14)
implicitly generates this tree and produces the maximal pairs. Before we delve
into the details of the algorithm, we note two of its important characteristics.
Set-Theoretic Algorithmic Tools 437

Input: C & quorum K



Impose an ordering of Σ elements
⇓ 
Complete binary ordered tree 






prune







implicit in

Pruned tree Algorithm (14)






collapse 








Search tree


Trie (of intersection sets)

Output: the maximal pairs (S, IS )

FIGURE 13.9: Overall scheme of the enumeration algorithm.

1. The running time complexity of the algorithm is output-sensitive i.e.,


the amount of work done is (almost) linear with the size of the output.

2. It is possible to output the maximal sets as the algorithm is executed,


without ever having a need to backtrack. In other words, the enumera-
tion is done in such a way that once a set (S, with its IS ) is ascertained to
be maximal, it remains so till the end of the execution of the algorithm.

How does the algorithm work? This is best understood as a scheme


described in Figure 13.9.
Note that the algorithm does not follow the stages–for efficiency reasons
it directly constructs the trie after ordering the elements of Σ. The ‘boxed’
stages aid in proving the correctness of the algorithm and in the analysis of
running time complexity of the algorithm.
438 Pattern Discovery in Bioinformatics: Theory & Algorithms

Algorithm 14 (Maximal Set Intersection Problem)

maxSIP(C, K, S, IS , j)
{
IF j > 0 AND |IS | ≥ K
ISnew ← {i | i ∈ IS AND σj ∈ (Ci ∈ C)} //takes O(n) time
IF (|ISnew | ≥ K) {
Snew ← S ∪ {σj }
Terminate ← FALSE
IF (ISold = Exists(T , ISnew )) //takes O(log n) time
IF |ISnew | = |IS | //immediate parent;
//(Snew , IS ) is possibly maximal
Replace(T , Sold , Snew ) //takes O(log n) time
ELSE Terminate ← TRUE //(Snew , ISnew ) is nonmaximal,
//hence terminate this branch
ELSE Add(T , Snew ) //takes O(log n) time
IF NOT Terminate
maxSIP(C, K, Snew , ISnew , j-1) //left-child call
}

OUTPUT(S, IS ) //(S, IS ) is certainly maximal

IF |ISnew | 6= |IS | OR j = 1 //right & left subtrees the same


//OR very last right-child call
maxSIP(C, K, S, IS , j-1) //right-child call
}

Input parameters. The routine in Algorithm (14) takes the following four
parameters:

1. A collection of n sets

C = {Ci , C2 , . . . , Cn },

where each Ci is defined on some alphabet Σ = {σ1 , σ2 , . . . , σm }, say


and 1 ≤ i ≤ n.

2. The quorum K that restricts every output set S to have at least K


elements.

3. The pair S and IS computed until this point. The algorithm is initiated
with the following settings:

(a) S ← ∅ and IS ← U = {1, 2, 3, . . . , n}.


(b) j = m (recall |Σ| = m).

4. The symbol σj ∈ Σ which is given as simply j in the call.


Set-Theoretic Algorithmic Tools 439

The Complete Binary Ordered Tree BC . Recall that, in general

σ 1 < σ 2 < . . . < σm .

A small example is discussed in Figure 13.10. The input sets are shown in (a)
where the alphabet is
Σ = {a < c < d}.
A complete binary ordered tree, BC,K , for this input is shown in (b). The
root of this tree is labeled with the pair (S, IS ) where

S = ∅ and IS = U = {1, 2, . . . , n},

where n = |C|. Every internal node (including the root node) has exactly
two children: the left child is labeled with some σj ∈ Σ, 1 ≤ j ≤ m, and the
right-child is unlabeled and is shown as a dashed edge in the figure.
For a node v at depth of j from the root, define pathv , the path from the
root node down to node v on this complete binary ordered tree, by the labels
on the edges written as

pathv = X1 X2 . . . Xj ,

where, for each 1 ≤ k ≤ j,



{σk } if the edge label Xk is a left child with label σj ,
Xk =
∅ if the edge label Xk is a right child.

The node is labeled by the pair (Sv , ISv ) which is defined as:

Sv = Σv ∩ Ci1 ∩ Ci2 ∩ . . . ∩ Cip , and
ISv = {i1 , i2 , . . . , ip }.

where
Σv = X 1 ∪ X 2 ∪ . . . ∪ X j .
In other words,

1. the left child corresponds to ‘presence of σj ’, and

2. the right child corresponds to ‘ignoring σj ’ (note that it is not ‘absence


of σj ’).

LEMMA 13.7
Each dashed edge is such that the label, (S, IS ), on both the nodes of this
edge are the same.
440 Pattern Discovery in Bioinformatics: Theory & Algorithms

C1 = {a, c},
C2 = {a, d},
C3 = {a, c, d}.
(a) The input C = {C1 , C2 , C3 }, with quorum K = 2.

(0, {1,2,3})
a
({a}, {1,2,3}) (0, {1,2,3})
c c
({a,c}, ({a}, ({c},
{1,3}) {1,2,3}) {1,3}) (0, {1,2,3})
d d d d

({a,c,d}, ({a,c}, ({a,d}, ({a}, ({c,d}, ({c}, ({d}, (0,


{3}) {1,3}) {2,3}) {1,2,3}) {3}) {1,3}) {2,3}) {1,2,3})
(b) The complete binary ordered enumeration tree for C
with a < c < d.

S IS
{a} {1, 2, 3}
{a, c} {1, 3}
{a, d} {2, 3}
(c) The solution (maximal intersection pairs) to the input.
(note that the pair ({c}, {1, 3}) is not maximal)

FIGURE 13.10: (a) An input, (b) the corresponding complete binary


ordered enumeration tree, and (c) the solution.
Set-Theoretic Algorithmic Tools 441

See Figure 13.10(b) for a concrete example. It can be verified that the ordered
enumeration trie, TC,K , is ‘contained’ in this complete binary ordered tree
BC .
We now relate BC to the pseudocode in Algorithm (14). This tree is implic-
itly generated by the algorithm. Note that this is the complete tree, but the
code will prune the tree (as discussed later) for efficiency purposes. In detail,
the algorithm can be understood as follows. The routine makes at most two
recursive calls
1. marked as left-child call, and
2. marked as the right-child call.
Thus each left-child edge of BC is marked with {σj } while the right-child edge
is labeled (implicitly) with an empty set ∅ (or unlabeled in Figure 13.10(b)).
For easy retrieval the pair (Sv , ISv ) at node v is stored in a balanced tree data
structure T (see below).
Pruning BC . Next, we explore how this complete binary ordered tree is
pruned. This is done by identifying a set of traversal terminating conditions.
Every recursive routine must have a mandatory termination condition for
obvious reasons.4 For efficiency purposes, the routine has four additional
terminating conditions. The terminations of a recursive call corresponds to
the pruning of the complete binary search tree. All the terminating conditions
are discussed below.

1. The ‘mandatory’ terminating condition when all the alphabet has been
explored (the j > 0 condition).
2. We discuss two terminating conditions here, both arising when the com-
puted ISnew is found to be the same as some existing set (ISold or IS ).
This condition ensures that a nonmaximal set S is detected no more
than once, giving an asymptotic improvement in the overall efficiency
(leading to an output-sensitive time complexity) of the algorithm.
(a) Case ISnew = ISold , i.e., the freshly computed ISnew already exists
in the data structure T .
Further if
Sold ⊃ Snew ,
then clearly Snew is nonmaximal. In this case not only Snew can be
discarded but also, the traversal can be terminated here, since it
is guaranteed that all the subsequent sets detected on this subtree
(of the complete binary ordered tree) will be nonmaximal.
However, if
Sold 6⊃ Snew ,

4 Otherwise, the run-time stack will overflow eventually crashing the system.
442 Pattern Discovery in Bioinformatics: Theory & Algorithms

Left subtree Right subtree

({a}, {1,2,3}) (0, {1,2,3})


c c
({a,c}, ({a}, ({c},
{1,3}) {1,2,3}) {1,3}) (0, {1,2,3})
d d d d

({a,c,d}, ({a,c}, ({a,d}, ({a}, ({c,d}, ({c}, ({d}, (0,


{3}) {1,3}) {2,3}) {1,2,3}) {3}) {1,3}) {2,3}) {1,2,3})
FIGURE 13.11: Consider the tree of Figure 13.10. The left and the right
subtrees of the root node when the set of indies I is the same on both nodes.
Notice that the index sets I are identical on the nodes in both the subtrees but
the sets S are nonmaximal in the right subtree, i.e., the set S R on the right
subtree is a subset of the corresponding S L on the left subtree (S R ⊂ S L ).

then Snew is possibly 5 maximal. What can we say about Sold ?


The answer lies in the following question: Is it possible that

Sold 6⊃ Snew and Snew 6⊃ Sold but ISnew = ISold ?

If this is the case then there must exist

S ′ ⊇ Sold ∪ Snew ,

with
IS ′ = ISnew = ISold .
Hence it must be that
Sold ⊂ Snew
and Snew replaces Sold in the data structure T .
We use the variable Terminate, in the pseudocode to check for this
condition.
(b) Case ISnew = IS (at the ‘right-child’ call).
In fact, it is adequate to simply check if the set sizes are the same,
i.e.,
|IS | = |ISnew |,
which can be done in O(1) time.
Why can we safely terminate the search of this branch (‘right-
child’) on the complete binary ordered tree? Since IS = ISnew , the

5 In other words S
new is maximal until this point in the execution. The possibility remains
that sometime later in the execution, it may turn out to be indeed nonmaximal.
Set-Theoretic Algorithmic Tools 443

condition (3)
I
(|I| = |{3}|) < (K = 2)
(0, {1,2,3})
condition (4)
a II
j = 1 and right-child call
V
({a}, {1,2,3}) condition (1)
III
c ({a},
j = |Σ| = 3
({a,c},
{1,3}) {1,2,3})
condition (4)
d d IV
II j = 1 and right-child call
{3} IV
I ({a,d}, condition (2b)
{2,3}) III IS = ISold
V
right and left
subtrees the same

FIGURE 13.12: The pruned search tree of Figure 13.10. The order of
symbols is a < c < d and quorum K = 2. The different terminating or
pruning conditions are explained above. See text for more details.

two subtrees corresponding to the left and the right child rooted
at this node are identical. Further, all the sets (S) associated with
the nodes of the right subtree are nonmaximal. See Figure 13.11
for a concrete example of nonmaximal sets in the right subtree.

3. If the set size, |ISnew |, falls below quorum, K, the search on the complete
binary ordered tree can be terminated. This gives rise to efficiency in
practice but no asymptotic improvements can be claimed due to this
condition.

4. The very last (when j = 1) ‘right-child’ call need not be made, since it
does not give rise to any new sets S. Again, this gives rise to efficiency
in practice but no asymptotic improvements can be claimed due to this
condition.

For a complete example see Figure 13.12: it shows the pruned version of the
complete binary tree shown in Figure 13.10, along with the various terminat-
ing conditions.
Collapsing the pruned enumeration tree. The dashed edges of the pruned tree
can be collapsed, so that each node in this collapsed tree has a unique label
given as (S, IS ) and the collapsed tree is shown in Figure 13.13(b).
We call this the search tree. We study the characteristics of this tree

1. We identify three kinds of vertices in the search tree:

(a) Solid circles: these correspond to the maximal sets.


444 Pattern Discovery in Bioinformatics: Theory & Algorithms

(0, {1,2,3})
a (0, {1,2,3})
({a}, {1,2,3}) a
c
({a,c}, ({a}, {1,2,3})
{1,3}) ({a}, {1,2,3}) c
d d ({a,c}, d
{3}
{1,3})
({a,d}, {2,3}) ({a,d},{2,3})
(a) Pruned tree. (b) Collapsed pruned tree, or,
the search trie.

FIGURE 13.13: The pruned tree of Figure 13.12 can be collapsed to


the search trie shown on the right. The dashed edge has identical labels
({a}, {1, 2, 3}) at both its end-points (see Lemma (13.7)) and represents a
nonexistent call. This dashed edge is removed to produce a collapsed tree
with unique labels at each node. It retains the stub edges to keep track of the
(number of) terminated calls.

(b) Hollow circles: these correspond to the nonmaximal sets.


(c) Little squares: these are the ‘nonexistent’ nodes, since the recur-
sive call is terminated due to different terminating conditions.

2. We identify two kinds of edges in the reduced tree:

(a) Stubs: These are edges that are incident on the little square nodes.
(b) Regular: These are the ones that are not stubs.

The search tree without the stubs is indeed the trie that we discussed earlier
in the section.
The edges in the search tree (both regular and stub) correspond to the
number of recursive calls in Algorithm (14). Using Lemma (13.6), we obtain
the following:

LEMMA 13.8
The number of regular and stub edges in the search tree is no more than
X
|Σ| |S|.
S∈SC,K
Set-Theoretic Algorithmic Tools 445

(S, IS )
S1 = {a},
IS1 = {1, 2, 3}.
S2 = {a, c},
IS2 = {1, 3}. a c
S3 = {a, d}, ({a}, {1,2,3}) ({c},
a ({a},
IS3 = {2, 3}. {1,3}) {1,2,3})
c
S1 = d S2 =
a d
{seq(S1 ) = a, {seq(S1 ) = a,
seq(S2 ) = ac, ({a,c}, ({a,d}, seq(S2 ) = ca, ({a,c}, ({a,d},
seq(S3 ) = ad}. {1,3}) {2,3}) seq(S3 ) = ad}. {1,3}) {2,3})

(a) a < c < d and the trie for S1 . (b) c < a < d and the trie for S2 .

FIGURE 13.14: The maximal sets shown as solid circles and nonmaximal
as hollow circles in the trie (or the search tree with the stubs). Different
orderings leading to different tries, but the same maximal pairs.

Different ordering of elements of Σ. Figure 13.14 shows the search tree


for two different orderings of the alphabet Σ. Clearly, the tries are topologi-
cally different (i.e., not isomorphic to each other), but the resulting maximal
sets are the same. This is straightforward to verify and we leave that as an
exercise for the reader.

Order of maximal sets. The boxed statement OUTPUT(S, IS ) outputs


all the maximal pairs (S, IS ). It turns out that these sets correspond exactly
to the sets stored in T . In other words, the routine can simply output the
maximal set S (at the boxed statement) with never a need to undo or augment
the set later.
Indeed, this is an optional statement in the routine. This just indicates
that the maximal sets can be ‘output’ as they are generated.

Operations of the data structure T . For efficiency purposes, the algo-


rithm uses a (balanced) tree data structure T to store the sets represented by
(S, IS ), sorted by IS . The different routines on T are as follows. Each routine
takes O(log n) time (recall that |C| = n).

1. IS′ = Exists(T , IS ) checks if IS exists in the data structure and if it does,


returns it as IS ′ .

2. Replace(T ,S,S ′ ), replaces S in the data structure T with S ′ .

3. Add(T ,(S, IS )) adds (S, IS ) as a new node in the data structure T using
IS as the key.
446 Pattern Discovery in Bioinformatics: Theory & Algorithms

Algorithm time complexity. Let the size of the input NI and the size of
the output NO be given as
X
NI = |C|,
C∈ C
X
NO = (|S| + |IS |) ,
S∈SC,K

using SC,K defined in Equation (13.8). For convenience, let


X
NO1 = |S|,
S∈SC,K
X
NO2 = |IS |.
S∈SC,K

Thus
NO = NO1 + NO2 .

The number of calls is bounded by the total number of edges in the search
tree given in Lemma (13.6) as O(NO1 |Σ|). Each routine call, which corre-
sponds to a node in this tree, takes

1. O(NI ) time to read the input,

2. O(n) time to compute ISnew , and

3. each operation on the (balanced) tree T takes O(log |C|) = O(log n)


time.

Thus the amount of work done on all the nodes of the search tree is

O (NO1 (NI + n + log n)|Σ|) .

However, a more careful counting significantly tightens this bound.


To improve the bound, we use NO2 in a more meaningful manner. We ask
the question: How many times is an index set, IS , encountered (or ‘read’) in
the course of the execution of the algorithm? It turns out that this number is
O(NS ) where NS is the the number of nonmaximal sets S ′ of S encountered in
the search tree. For any S, the number of nonmaximal S ′ with respect to S is
O(|Σ|), as a rough estimate. Also, notice in the pseudocode of Algorithm (14),
that the input is also read along with the index set. Thus the overall time
taken for reading, all the index sets along with the input is

O((NO2 + NI )|Σ|).
Set-Theoretic Algorithmic Tools 447

Combining, this with the time taken for the remainder of the routine, the
overall time complexity of Algorithm (14) is given as:

O((NO2 + NI )|Σ| + NO1 log n|Σ|)


= O((NO2 + NI + NO1 log n)|Σ|)
= O(NI + NO log n)

This time complexity is considered output-sensitive since the amount of work


done is (almost) linear with the size of the output NO . We say ‘almost linear’
since strictly speaking, there is also a logarithmic term, log n, in the formula.

13.7 Minimal Set Intersection Problem (minSIP)


Problem 25 (Minimal Set Intersection Problem (minSIP(C,K))) The input
to the problem is a collection of n sets C = {C1 , C2 , . . . , Cn }, and a quorum
K > 0. For a set S such that

S = C i1 ∩ C i2 ∩ . . . ∩ C ip ,

we denote by IS the set of indices

IS = {i1 , i2 , . . . , ip }.

Further, IS is minimal i.e., there is no I ′ with

I ′ ( IS ,

such that

S = Cj1 ∩ Cj2 ∩ . . . ∩ Cjp′ , where I ′ = {j1 , j2 , . . . , jp′ }.

The output is the set of all pairs (S, IS ) such that |S| ≥ K.

Notice here that it is possible to have distinct minimal sets S1 , S2 , . . . , Sp ,


such that
IS1 = IS2 = . . . , = ISp .
In other words, multiple sets may be associated with a single index set I.

13.7.1 Algorithm
We design an algorithm along the lines of Algorithm (14) as follows here.
The algorithm (Algorithm (15)) is almost the same except for two differences:
448 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. There is only one terminating condition in the routine here which is when
the size of the set ISnew falls below the quorum K. In Algorithm (14),
there is yet another terminating condition which is when the set Snew
is nonmaximal.
2. Since multiple sets can be associated with one index set I, note that Sold
is a collection of sets. Further we use a new routine on the balanced
binary tree T called Append(T , Sold , Snew ). This routine first removes
any S ∈ Sold from the collection Sold satisfying

S ⊃ Snew .

Then Snew is added to the collection Sold if there is no S ∈ Sold such


that
S ⊂ Snew .

Clearly, the computation of the minimal sets is more expensive because of


the Append(·) routine and the lack of quick termination due to nonmaximality
or such similar conditions. We leave the time complexity analysis of this
algorithm as an exercise for the reader. Instead, we describe a scheme where
the minimal sets are computed from the maximal sets.

Algorithm 15 (Minimal Set Intersection Problem)

minSIP(C, K, S, IS , j)
{
IF (j ≤ 0) EXIT
ISnew ← {i | i ∈ IS AND σj ∈ (Ci ∈ C)}
IF (|ISnew | ≥ K) {
Snew ← S ∪ {σj }
IF (ISold = Exists(T , ISnew ))
IF |ISnew | = |IS | //immediate parent, so dont update T
ELSE Append(T , Sold , Snew ) //multiple minimal sets
//if subset of existing set S ′ , remove S ′ from T
ELSE Add(T , Snew )
minSIP(C, K, Snew , ISnew , j-1)
}
minSIP(C, K, S, IS , j-1)
}

13.7.2 Minimal from maximal sets


In this section we discuss how to extract the minimal sets while computing
the maximal sets. This is best explained through a concrete example. See
Figure 13.15 for a portion of a search trie. This tree is not in reduced form,
i.e., it retains the internal nodes that have only one child. In other words,
every node of this tree is labeled by a single character.
Set-Theoretic Algorithmic Tools 449

({e}, {1,2,3,4,5,6})

({e,b}, b f ({e,f}, {1,2,3,4,6})


{1,2,3,5,6})
a h
({e,b,a}, {1,2,3,5,6}) ({e,f,h},
{1,2,3,4,6})
({e,b,a,c}, c d g
({e,b,a,d}, ({e,f,h,g},
{1,3,5})
{2,3,5}) {1,2,3,4,6})

FIGURE 13.15: A search trie with singleton labels on the edges. The solid
nodes represent maximal set pairs. The hollow nodes with a pointing arrow
represent minimal set pairs.

Let the label of a node A be the pair

(SA , ISA ).

The following observation follows directly from the construction of the trie.

LEMMA 13.9
(Trie partial-order lemma) Node B is a descendent of node A in the trie,
if and only if the following two statements hold:

SB ⊃ SA and
ISB ⊆ ISA .

Then the following is straightforward to see and we leave the proof as an


exercise for the reader (Exercise 171).

1. Node A denotes a maximal intersection pair (SA , ISA ) if and only if

A has more than one child.

2. Node A denotes a minimal intersection pair (SA , ISA ) if and only if

the parent of A has multiple children and


A has a single child.
450 Pattern Discovery in Bioinformatics: Theory & Algorithms

13.8 Multi-Sets
We next consider multi-sets, where the multiplicity of the elements (some-
times also called copy number) is taken into account. For example a multi-set
is given as:
S = {σ(k) | σ ∈ Σ, k ≥ 1}.
In other words, each element σ also has a copy number stating the number
of times σ appears in the set. The sets considered in Section 13.6 were such
that k = 1 for each σ. For example, if
S = {a(2), b(4)},
then multi-set S has two copies of a and four copies of b. Let
Σ′ = {σ(c) | σ ∈ Σ and c is a copy number in the data}.

Set operations of multi-sets. Given multi-set S, let the closure of S be


defined as follows:
cls(S) = S ∪ {σ(0) | σ ∈ Σ and σ(-) 6∈ S}.
1. Given two multi-sets S1 and S2 ,
S1 ⊂ S2 ⇔ for each σ(k) ∈ cls(S1 ), there exists σ(k′ ≥ k) ∈ S2 .

2. Given multi-sets S1 , S2 , . . . , Sp ,
S1 ∩ S2 ∩ . . . ∩ Sp = {σ(kmin ) | kmin = min (ki ), where σ(ki ) ∈ Si }.
1≤i≤p

The following are some illustrative examples. Here Σ = {a, b, c}.


{a(2), b(4)} = {b(4), a(2)},
{a(2), b(2)} ⊂ {b(2), a(3)},
{a(2), b(2)} ⊂ {b(2), a(3), c(1)},
{a(2), b(2)} 6⊂ {b(2), c(1)}, and
{b(2), c(1)} 6⊂ {a(2), b(2)}.
The maximal multi-set intersection problem is exactly along the lines of
Problem (24) and is stated below.

Problem 26 (Maximal Multi-Set Intersection Problem (maxMIP(S,K))) The


input to the problem is a collection of n multi-sets S = {C1 , C2 , . . . , Cn }, and
a quorum K > 0. For a multi-set S such that
S = C i1 ∩ C i2 ∩ . . . ∩ C ip ,
Set-Theoretic Algorithmic Tools 451

we denote by IS the set of S indices


IS = {i1 , i2 , . . . , ip }.
Further, IS is maximal i.e., there is no I ′ with
IS ( I ′ ,
such that
S = Cj1 ∩ Cj2 ∩ . . . ∩ Cjp′ , where I ′ = {j1 , j2 , . . . , jp′ }.
The output is the set of all pairs (S, IS ) such that |S| ≥ K.
For example, consider three multi-sets with Σ = {a, b, d}:
S1 = {a(2), b(6)},
S2 = {a(1), b(6), d(2)}, and
S3 = {a(3), b(2), d(2)}.

Let quorum K = 2. What are the maximal multi-sets? Using the problem
specification, the maximal intersection multi-sets are given below.
S1 = {a(1), b(2)} with IS1 = {1, 2, 3},
S2 = {a(1), b(6)} with IS2 = {1, 2},
S3 = {a(2), b(2)} with IS3 = {1, 3}, and
S4 = {a(1), b(2), d(2)} with IS4 = {2, 3}.
We use this as the running example in the remainder of the discussion.

13.8.1 Ordered enumeration trie of multi-sets


Renaming scheme. Can we treat each symbol σ with its copy number c,
say σ(c) as a new symbol, say σc ? For a concrete example let a symbol a
have two different copy numbers say 3 and 7. Consider the following simple
example:
C1 = {a(3)} and C2 = {a(7)}.
Now if we replace a with two new symbols a3 and a7 , then the new sets are
C1′ = {a3 } and C2′ = {a7 },
with
C1′ ∩ C2′ = ∅ but C1 ∩ C2 = {a(3)}.
Thus, the original input set C1 and C2 has one intersection set given as:
S = {a(3)} with IS = {1, 2},
Clearly this renaming scheme does not work. We must take the ‘interactions’
of a(3) and a(7) into account.
452 Pattern Discovery in Bioinformatics: Theory & Algorithms

a(1)

a(1)
S1 = {a(1), b(2)}, a(2)
S2 = {a(1), b(6)},
b(2)
S3 = {a(2), b(2)}, a(2)
S4 = {a(1), b(2), d(2)}. b(2)
a(1)b(2)
b(6)
S=
{seq(S1 ) = a(1) b(2), d(2)
a(2),b(2) a(1)b(6)
seq(S2 ) = a(1) b(6),
seq(S3 ) = a(2) b(2),
seq(S4 ) = a(1) b(2) d(2)}. a(1)b(2)d(2)

(a) a < b < d and the trie for S.

FIGURE 13.16: Given four multi-sets S1 , S2 , S3 and S4 . The corre-


sponding collection of sequences S and the trie for S.

Trie of multi-sets. As for the other sets, we first define an ordering on the
elements of Σ. Let
Σ = {σ1 < σ2 < . . . < σm }.
Given a set
S = {σi1 (ci1 ), σi2 (ci2 ), . . . , σil (cil )},
with
σ i 1 < σ i 2 < . . . σi l ,
define a sequence, seq(S), as

seq(S) = σi1 (ci1 ) σi2 (ci2 ) . . . σil (cil ).

In the sequence, each element σi appears exactly once and is annotated with
the copy number ci . A trie of the sequences is a tree satisfying the following
properties.

1. Each sequence must be represented by a unique path from the root to


the leaf node. In the trie for the multi-set, if a symbol σ appears multiple
times with different copy numbers

cmin = c1 < c2 < . . . < cl = cmax ,

then we interpret that as a single symbol σ(cmax ).

2. No two siblings in the trie have a label of the form σ(-), i.e, the same
symbol σ.
Set-Theoretic Algorithmic Tools 453

a(1)

a(2) b(2)

a(3) b(2) b(6) d(2)

b(2) b(2) b(6) d(2) d(2) d(2)

b(6) d(2) b(6) d(2) d(2) d(2)

d(2) d(2) d(2) d(2)

FIGURE 13.17: Let Σ = {a < b < d} where the data shows these
copy numbers: 1, 2, 3 for a; 2, 6 for b; 2 for d. The complete binary ordered
enumeration tree for such an input is shown above. To avoid clutter, the
labels on the nodes are omitted.

The running example is discussed in Figure 13.16. This shows how the
sequence is ‘represented’ by the trie (encoded at the nodes). Each edge is
labeled by the element σ and its copy number. What is the size of the trie?
In fact, it is the same as before and Lemma (13.6) holds even for the multi-sets.

13.8.2 Enumeration algorithm


We again propose a simple ordered enumeration scheme along the lines of
Algorithm (14). See Figure 13.9 for the overall scheme.

The complete binary ordered tree. We define the Complete Binary Or-
dered Tree as before with a few additional properties due to the multiplicities.
For each σj ∈ Σ, let the jn copy numbers of σj be as

cj1 < cj2 < . . . < cjn .

1. As before, every internal node (including the root node) has exactly two
children. The edges are labeled as follows. The right-child (edge) is
unlabeled and is shown as a dashed edge in the figure. The left-child
(edge) is labeled as follows.

(a) The left child of the root node is labeled with σ1 (c11 ).
(b) Let an internal node v have an incoming edge labeled as σj (cjl ),
then its left child has the following label:

σj (cjl+1 ), if (l + 1) ≤ jn ,
σj+1 (cj1 ), otherwise.
454 Pattern Discovery in Bioinformatics: Theory & Algorithms

2. Every node is labeled by the pair (S, IS ) as before. S is determined


exactly as in the trie for multi-sets.

See Figure 13.17 for the running example. The tree is pruned using exactly
the same terminating conditions as before and the pruned tree for the running
example is shown in Figure 13.18(a) and the corresponding collapsed tree, or
the search tree, is shown in Figure 13.18(b). Figure 13.19 shows the search
tree for a different ordering of the elements of Σ.

Reorganizing the input. For efficiency purposes, we reorganize the input


sets. For each σj ∈ Σ, a ‘dictionary’ list, Dicσj , is built. Let the jn copy
numbers of σj be as

cj1 < cj2 < . . . < cjn .

This is best explained through an example. Continuing the running example,


the dictionary lists are as follows:

1 2 3 2 6 2
↓ ↓ ↓ ↓ ↓ ↓
Dica→ 2 → 1 → 3 –| Dicb→ 3 → 1 → 2 –| Dicd→ 2 → 3 –|

The sublist corresponding to cjr is denoted as Dicσj (cjr ) . Note that each
sublist includes all elements to the right, thus with an abuse of notation if
Dicσ(c) denotes the set of the elements in the list, then

Dica(1) = {2, 1, 3},


Dica(2) = {1, 3},
Dica(3) = {3},

with Dica(1) ) Dica(2) ) Dica(3) .


For convenience, we maintain this as a single list and the algorithm uses
the three sublists in three consecutive recursive calls (unless the calls are
terminated by other conditions). The pseudocode for this for complete ordered
enumeration is given in Algorithm (16). This follows exactly along the lines
on Algorithm (14). The major difference in the algorithms is in the use of
copy numbers along with the symbols. These statements are shown boxed
in the code. This code is geared towards using the sorted list Dic described
above.
Set-Theoretic Algorithmic Tools 455

Algorithm 16 (Maximal Multi-Set Intersection Problem (MIP))

//For each σj ∈ Σ,
//let the jn copy numbers of σj be sorted as:
// cj1 < cj2 < . . . < cjn .

maxMIP(C, K, S, IS , j, r)
{
IF j > 0 AND |IS | ≥ K
ISnew ← {i | i ∈ IS , Dicσj (cjr ) }
IF (|ISnew | ≥ K) {
Snew ← S ∪ {σj (cjr )}
Terminate ← FALSE
IF (ISold = Exists(T , ISnew ))
IF |ISnew | = |IS | //immediate parent;
//S is possibly maximal
Replace(T , Sold , Snew )
ELSE Terminate ← TRUE //S is nonmaximal,
//hence terminate this branch
ELSE Add(T , Snew )
IF NOT Terminate

j ′ ← (jr = jn )? (j + 1) : j; //advance within Dicσj list


r ′ ← (jr = jn )? 1 : (r + 1) //advance to Dicσj+1

maxMIP(C, K, Snew , ISnew , j ′ , r ′ ) //left-child call


}

OUTPUT(S, IS ) //S is certainly maximal

IF |ISnew | 6= |IS | OR j = 1 //right & left subtrees the same


//OR very last right-child call
maxMIP(C, K, S, IS , j-1,1) //right-child call
}

13.9 Adapting the Enumeration Scheme


The enumeration scheme presented here is a depth first traversal of the
trie. It is the power of such a scheme that a code of less than fifteen lines can
accurately encode the solution to a problem as complex as Problem (24).
Recall that the routine in Algorithm (14) is recursive, enabling the depth
first traversal of the implicit trie. It invokes many instances of itself (no
456 Pattern Discovery in Bioinformatics: Theory & Algorithms

(0,{1,2,3})
a(1)
({a(1)},{1,2,3})
a(2)
({a(2)},{1,3}) ({a(1)},{1,2,3})

b(2)
({a(2)},{1,3}) ({a(1),b(2)},{1,2,3})
b(2) b(6)
({a(2),b(2)}, ({a(1),b(6)}, ({a(1),b(2)},{1,2,3})
{1,3}) {1,2}) d(2)
({a(2),b(2)},
{1,3}) ({a(1),b(2),d(2)},
{2,3})
(a) The pruned tree.
(0,{1,2,3})
a(1)
({a(1)},{1,2,3})
a(2)
({a(2)},{1,3}) b(2)

b(2)
({a(1),b(2)},{1,2,3})
b(6)
({a(2),b(2)}, ({a(1),b(6)}, d(2)
{1,3}) {1,2})
({a(1),b(2),d(2)},
{2,3})
(b) The collapsed pruned tree or the search tree.
FIGURE 13.18: The complete binary tree of Figure 13.17 has been
pruned by using the different terminating conditions in the routine. This
tree has been further collapsed to obtain the search tree. The maximal sets
are shown as solid circles.
Set-Theoretic Algorithmic Tools 457
(0,{1,2,3})
b(2)
({b(2)},{1,2,3})
b(6)
({b(6)},{1,2}) a(1)

a(1) ({a(1),b(2)},{1,2,3})
a(2)
({a(1),b(6)}, ({a(2),b(2)}, d(2)
{1,2}) {1,3})
({a(1),b(2),d(2)},
{2,3})

FIGURE 13.19: The ordered enumeration trie with b < a < d. The nodes
with maximal pairs as shown as solid circles.

more than two at each call). At some time point, these instances, due to
the recursive nature of the process, are partially executed and are waiting for
other instances to complete their execution before they can complete their
own. For very large data sets, the number of such partially executed routines
may pile up, eventually running out of space in the computer memory.
Is the depth-first order of traversal really crucial? There are at least two
implications of this order of traversal.

1. Run time efficiency: The order ensures that a maximal set has been seen
before its nonmaximal versions. This results in an effective pruning of
the trie due to termination conditions (2) of Section 13.6. The only
exception is when the intersection set S is being built, one element σ at
a time.

2. OUTPUT statement in the algorithm description: This order of traver-


sal guarantees that the routine can spit out the maximal sets, without
ever having to backtrack.

It is always possible to rewrite the depth first traversal code as a nonrecur-


sive procedure. This code maintains the trie explicitly. It is also possible to
carry out a breadth first traversal, at the cost of a less time efficient algorithm.
We leave these as exercises for the reader (Exercise 169).
These are two systematic ways of dealing with run-time space issues for very
large data sets. Sometimes in practice, two sets, S1 and S2 , are deemed equal
(S1 ≈ S2 ), if they have a large number of common elements. Quantitatively,

|S1 ∩ S2 |
(S1 ≈ S2 ) ⇔ ≥δ
|S1 ∪ S2 |
458 Pattern Discovery in Bioinformatics: Theory & Algorithms

for some fixed 0 < δ ≤ 1. There is no guarantee that such an assumption


will reduce the number of solutions, but may be tractable in some problem
domain. Of course, the question whether the algorithm reports all solutions
that fit the definition must be addressed.

13.10 Exercises
Exercise 157 (Unique transitive reduction lemma)

1. Show that Er (of Section 13.3) is the smallest set of edges such that

C(G(S, E)) = C(G(S, Er )).

2. If E ′ is another set of edges with

|Er | = |E ′ |, and
C(G(S, Er )) = C(G(S, E ′)),

then show that


E ′ = Er .

Hint: Use proof by contradiction.

Exercise 158 (Size of reduced partial order) Consider a reduced partial


order graph G(S, Er ). Let |S| = n. In the figures below, two new terminating
nodes (shown as solid circles), a ‘start’ node and and an ‘end’ node are used
for convenience.

1. The partial order is called a total order if for any two nodes S1 , S2 ∈ S
there is

(a) a path from S1 to S2 or


(b) a path from S2 to S1 .

In this case,
|Er | = n + 1,
as shown below. However, without the terminating nodes, the number of
edges is n − 1.

n
Set-Theoretic Algorithmic Tools 459

2. The partial order is called an empty order if for any two nodes S1 , S2 ∈
S, there is

(a) neither a path from S1 to S2


(b) nor a path from S2 to S1 .

In this case,
|Er | = 2n,
as shown below. However, without the terminating nodes, the number of
edges is zero.

In the worst case, how many edges can a reduced partial order with n nodes
have?

Hint: Consider the following example. Is it a reduced partial order? How


many edges does the graph have?

n/2

Let
Σ = {σ1 , σ2 , . . . , σn/2 }.
Let a node in the first column be S1,i and in the second column be S2,i , then
for i = 1, 2, . . . , n/2,

S1,i = {σi },
S2,i = Σ \ {σi }.
460 Pattern Discovery in Bioinformatics: Theory & Algorithms

Exercise 159 (Incomparable nodes) Two nodes S1 , S2 ∈ S in the reduced


partial order G(S, Er ) are incomparable, if both of the following conditions
hold:

there is no path from S1 to S2 , and

there is no path from S2 to S1 .

1. In Figure 13.1(a), are the nodes {a, b} and {b, c} incomparable?


Why?

2. Show that any pair of siblings in a reduced partial order graph are in-
comparable.

Hint: (2) Does the statement hold when the partial order is not reduced?

Exercise 160 (Straddle graph) Given a reduced partial order G(S, Er ),


let
Sm ⊂ S
be the set of nodes that have at least one child with multiple parents. Then
show that Sm can be partitioned as

S m = S 1 ∪ S2 ∪ . . . ∪ Sh ,

where each partition Si , 1 ≤ i ≤ h induces a connected straddle graph.

Hint: If S, S ′ ∈ VS , then show that

VS = VS ′ and ES = ES ′ .

Exercise 161 (Boolean closure) Let S be a collection of sets defined on


alphabet Σ such that no two sets in S straddle.

1. Then show that the boolean closure, B(S), is the same as S, i.e.,

S = B(S) = B∩ (S) = B∪ (S).

2. Then show that the reduced partial order graph

G(S, Er ) = G(B(S), Er ) = G(B∩ (S), Er ) = G(B∪ (S), Er )

is acyclic.
Set-Theoretic Algorithmic Tools 461

3. Show that the elements of Σ can be consecutively arranged as a string


say s, such that for each S ∈ S, its members appear consecutive in s.
Hint: 1. What is S1 ∪ S2 or S1 ∩ S2 for any S1 , S2 ∈ S?
2. Does any node have multiple parents? Why?
3. Use the fact that the reduced partial order graph is a tree.

Exercise 162 (Frontiers) Enumerate F(T ) for the PQ tree, T , shown be-
low.

a b c d e f g
What is |F(T )|, for the given T ?

Exercise 163 For a tree T ,


1. whose leaves are labeled bijectively with the elements of Σ and
2. each internal node represents a unique set, S ∈ S, which is the collection
of the labels of the leaf nodes reachable from this node,
show that a consecutive arrangement of Σ that respects S is the frontier F (T ).
Hint: Show that either the sets are disjoint or one is contained in the other.
Thus this nested containment has a linear representation.

Exercise 164 (Linear arrangement)Consider the following reduced partial


order of a collections of sets. Can the sets be arranged along a line? Why?
462 Pattern Discovery in Bioinformatics: Theory & Algorithms

Exercise 165 (Set linear arrangement theorem) Consider Theorem (13.1)


and its proof presented in Section 13.1.
1. Prove that a consecutive arrangement sS of Equation (13.4) is always
possible.
2. Prove that a consensus arrangement exists when each straddle graph is
a chain.
Hint: 1. Recall that the graph is a reduced partial order of the boolean
closure of S. Finally, use proof by contradiction.
2. If there is only one nonsingleton straddle graph, then we are done. Assume
there are at least two distinct nonsingleton straddle graphs, say G(V1 , E1 ) and
G(V2 , E2 ). Let
σp ∈ Sij , σq ∈ Sik and σr ∈ Siℓ ,
where Sij , Sik , Siℓ ∈ V1 and j < k < ℓ. Next let
σp ∈ Sij ′ , σq ∈ Sik′ and σr ∈ Siℓ′ ,
where Sij ′ , Sik′ , Siℓ′ ∈ V2 and

neither j ′ < k′ < ℓ′ nor ℓ′ < k′ < j ′ holds. (13.10)


If j = k or k = ℓ, we are done. Also, if j ′ = k′ or k′ = ℓ′ , we are done. Show
that
(Sit ∈ V1 ) ⊂ (Sit′ ∈ V2 ), for t = j, k, ℓ.
and the assumption (13.10) must be wrong.

Exercise 166 (Local graph structures) Some properties of a graph can


be determined by studying its local structures. Consider two such structures
shown below for nodes with multiple parents in a reduced partial order of the
boolean closure of S:
G(B(S), Er ).

(a) Forbidden structure. (b) Mandatory structure


(siblings, indicated by dashed edges,
possibly empty).
1. Show that if G(B(S), Er ) has a forbidden structure, then
F(S) = ∅,
i.e., the sets cannot be consecutively arranged.
Set-Theoretic Algorithmic Tools 463

2. If every node with two parents has the mandatory structure, then is

F(S) 6= ∅,

i.e., can the sets be consecutively arranged?


3. Is it possible to conclude if

F(S) = ∅,

by studying local properties of

G(B(S), Er ).

Hint: (1) The forbidden structure shows nodes with more than two parents.
Note that the graph is a reduced partial order of the boolean closure. (2)
Consider the example shown below. Does each node with multiple parents,
respect the mandatory structure? What is F(S)?
a,b,c

a,b b,c a,c

a b c

Exercise 167 (Minimal set intersection algorithm)


1. What is the running time complexity of Algorithm (15), the Minimal Set
Intersection Algorithm?
2. How does this compare with the running time complexity of the Maximal
Set Intersection Algorithm (14)?
Hint: Is it better to compute the minimal from the maximal sets?

Exercise 168 (Maximal intersection trie enumeration) Consider Prob-


lem (24), the maximal set intersection problem.
1. Can the search tree be made any more conservative? In other words, is
there room to improve the time complexity of Algorithm (14) by intro-
ducing more terminating conditions? Why?
2. If
m >> n,
i.e., the alphabet is much larger than the number of sets, how can Algo-
rithm (14) be modified to retain an efficient run time complexity?
464 Pattern Discovery in Bioinformatics: Theory & Algorithms

3. If
|Σ| = O(NI ),
how can the enumeration be changed to exploit this fact?

Exercise 169 (Maximal intersection trie enumeration) Consider Al-


gorithm (14).
1. How are the results affected if the order of the recursive calls marked
‘left-child call’ and ‘right-child call’ are switched?
2. Show that by changing the order of alphabet Algorithm (14) gives the
same maximal sets but possibly different search tree (or tries).
3. Re-write Algorithm (14) as a nonrecursive procedure.
4. Re-write Algorithm (14) as a nonrecursive procedure that traverses the
complete binary ordered tree BC in a breadth first order.
Hint: 3. Build the trie TC,K as an explicit structure and traverse it in a
depth first order.
4. Build BC as an explicit structure and traverse it in a breadth first order.
Also, identify the tree pruning conditions to make the procedure efficient.

Exercise 170 (Maximal multi-set intersection trie enumeration)


1. Let σ ∈ Σ and consider sets, Ci , 1 ≤ i ≤ 8, with

σ 6∈ C1 , σ(c2 ) ∈ C2 , σ(c1 ) ∈ C3 , σ(c2 ) ∈ C4 ,


σ(c2 ) ∈ C5 , σ(c3 ) ∈ C6 , σ(c1 ) ∈ C7 , σ(c3 ) ∈ C8 .

The list is sorted as follows for Algorithm (16).

c1 c2 c3
↓ ↓ ↓
Dicσ→ 3 → 7 → 2 → 4 → 5 → 6 → 8 –|

(a) What does the list assume about the ordering of the elements

c 1 , c2 , c3 ?

(b) What is important about this ordering? In other words, how is


Algorithm (16) affected if this ordering is not maintained in the
list?
2. Consider the pruned tree in Figure 13.18. Identify all the terminating
conditions in Algorithm (16) that bring about this pruning.
Set-Theoretic Algorithmic Tools 465

Exercise 171 (Maximal to minimal) Consider a search trie T where the


label of each node is a single character. Node A on this tree has the label
(SA , ISA ).

1. Show that the following two conditions on node A are equivalent.


Condition (1):

(a) the parent of A has multiple children and


(b) A has a single child.

Condition (2):

(a) Node B has multiple children and


(b) A is the closest ascendant of B such that
i. A has a single child and
ii. the parent (immediate ascendant) of A has multiple children.

2. Node A denotes a minimal intersection pair

(SA , ISA )

if and only if any one of conditions (1) and (2) hold.

Hint: For a fixed index set I ′ , how many nodes in the tree have label (S, IS )
with I ′ = IS ? How are these nodes arranged on the tree? Use Lemma (13.9).

Exercise 172 (Multi-set intersection problem)

1. Assume the input is a collection of multi-sets. The ordered enumeration


trie for two different orders

a < b < d,
b < a < d,

shown in Figures 13.18(b) and (13.19) respectively, are isomorphic to


each other.
Does this always hold? Why?
466 Pattern Discovery in Bioinformatics: Theory & Algorithms

2. Identify all the terminating conditions, marked by Roman numerals I,


II, . . . VIII, in the figure below for the example of Section 13.8.

(0,{1,2,3})
a(1) VIII
({a(1)},{1,2,3})
a(2)
({a(2)},{1,3}) ({a(1)},{1,2,3})
I b(2) VII
({a(2)},{1,3}) ({a(1),b(2)},{1,2,3})
b(2) b(6)
IV
({a(2),b(2)}, ({a(1),b(6)}, ({a(1),b(2)},{1,2,3})
{1,3}) II {1,2}) d(2) VI
V
({a(2),b(2)},
{1,3}) ({a(1),b(2),d(2)},
III {2,3})

Exercise 173 (Maximal string intersection) Consider an input of m se-


quences si , 1 ≤ i ≤ m on some alphabet Σ. Then the pair (p, Lp ), p ⊂ Σ and
Lp ⊂ {1, 2, . . . n}, is a maximal intersection when

p = {σ | i ∈ Lp and there is some k such that si [k] = σ},

and p is maximal when there exists no distinct p′ ⊃ p with

Lp′ = Lp .
Set-Theoretic Algorithmic Tools 467

Algorithm 17 Maximal String Intersection Algorithm (maxStIP)


Dicσj = {i | si [k] = σj for some k} //dictionary
Lp ← {1, 2, . . . , m}
p←∅
MineπPat(K, σ1 , L, p) //main call

maxStIP(K, σj , Lp , p)
IF (j ≤ |Σ|) AND (|Lp | ≥ K)
Lpsav ← Lp , psav ← p
FOR EACH i ∈ Lpsav
IF i 6∈ Dicσj Lp ← Lp \ {i}
p ← p ∪ {σj }
Quit ← ((|Lp | < K) OR (p ⊂ ExistPat(Lp )))
IF NOT Quit
StorePat(Lp , p) //new or updated motif
maxStIP(K, σj+1 , Lp , p) //with σj
IF |Lp | < |Lpsav | //only if the two are distinct
maxStIP(K, σj+1 , Lpsav , psav ) //ignoring σj

1. For each input string si (1 ≤ i ≤ m), construct a set as follows


Ci = {σ | si [k] = σ for some k}
Further let
C = {C1 , C2 , . . . , Cm }.
Does this algorithm produce the same results on the strings as Algo-
rithm (14) on C?
2. Identify the ‘left’ and the ‘right’ child calls.
3. Identify the recursive call terminating conditions.
4. Outline the steps in StorePat and ExistPat.
5. Compare this algorithm with Algorithm (14). Which is more efficient?
Why? Is there an asymptotic improvement in one algorithm over the
other?

Exercise 174 (Tree data structure) For an input C and quorum K, what
is the relationship between the trie
TC,K
and the data structure T to store the pairs (S, IS )?
Hint: Consider the reduced trie where each internal node as at least two
children. Can the trie be balanced?
468 Pattern Discovery in Bioinformatics: Theory & Algorithms

Comments
The material in this chapter is fairly straightforward. At first blush, it even
seems like it does not deserve the dignity of a dedicated chapter. However, I
have seen the very same problem pop up in so many hues and shapes that per-
haps an unobstructed treatment of the material, within its very own chapter,
will do more good than harm.
Chapter 14
Expression & Partial Order Motifs

It is better to understand some of the questions,


than to know all of the answers.
- adapted from James Thurber

14.1 Introduction
Consider the task of capturing the commonality across data sequences in a
variety of scenarios. Depending on the data and the domain, the questions
change.
1. Total order: Segments appear exactly at each occurrence and these are
called the string patterns.
Certain wild cards may be allowed or even flexible extension of the gap
regions (called extensible motifs).
2. No order (but proximity): If groups of elements appear together, even
if they respect no order, these clusters may be of interest. These are
called permutation patterns.
Again, they may show some substructures of proximity within them (as
PQ structures).
3. Partial order: Is it possible that key players are only partially ordered?
The key players themselves could be as simple as motifs or clusters or
as complex as a boolean expression.
Further, if the input is organized as a sequence and this order must be
important, these can be modeled as the mathematical structure partial
order.
Of course, a more general order is defined as a graph and topological mo-
tifs can be discovered from this organization of data possibly providing
some insight into the process that produced this data.
As the landscape changes, the questions change and so do the answers. There
is an interesting interplay of different ideas such as permutations of motifs or
extensible motifs of permutations or partial orders of expressions and so on.

469
470 Pattern Discovery in Bioinformatics: Theory & Algorithms

This chapter focuses on boolean expressions (on possibly string motifs);


partial orders of motifs and finally, partial orders of expressions on motifs.
We motivate the reader with a brief example below and details of the main
ideas are discussed in the subsequent sections.

14.1.1 Motivation
In the following, mini-motifs refer to string motifs. Consider the results
obtained by mining patterns in binding site layouts in four yeast species as
studied in Kellis et al. [KPE+ 03]: S. cerevisiae, S. paradoxus, S. mikatae, and
S. bayanus.
Out of 45,760 mini-motifs, some 2419 significantly conserved mini-motifs
are grouped into 72 consensus motifs. 1 For the small fraction of sequences
where some motifs occur more than once, only the position closest to the
TATA box is utilized. Many of these motifs correspond to known transcription
factor binding sites [ZZ99] whereas others are new and putative, supported
indirectly by co-expression or functional category enrichment.
We use the number id’s for the motifs and show an example below. Is there
more structure than just the cluster?

37 ∧ 66 ∧ 5

The symbol ∧ denotes ‘and’ and ∨ denotes ‘or’. Notice that motif 37 always
precedes motif 66, but motif 5 could be in any relative position, in each
of the clusters. This is captured in the partial order shown to the right
below. Symbols S and E are meta symbols that denote the left and right ends
respectively.

Spar (YDR034C-A): 37 66 5
Spar(YJL008C): 5 37 66 37 66
Spar (YJL007C): 5 37 66 S E
5
Spar (YMR083W): 37 66 5
Spar (YOR377W): 37 5 66
(1) Input sequence data. (2) Partial order motif.

Another example with the cluster

48 ∧ 55 ∧ 37 ∧ 5 ∧ 24

is shown below. Here motif 48 always precede motif 55 and motif 37 always
precedes motif 24. The common ordering of the elements is captured in the

1 See [KPE+ 03] for more details.


Expression & Partial Order Motifs 471

partial order shown on the right.


Spar (YDR034C-A): 48 37 24 55 5 48 55
Spar (YJL008C): 5 37 48 55 24
Spar (YJL007C): 5 37 48 55 24 S 5 E
Spar (YMR083W): 24 55 37 48 5
Spar (YOR377W): 48 55 37 5 24 37 24
(1) Input sequence data. (2) Partial order motif.
Thus, informally speaking, a partial order motif is a decorated cluster.
Partial order of boolean expressions: As a further generalization, each node
in the partial order is a boolean expression of multiple elements. The final
example here illustrates the use of disjunctions in the motif expressions. The
common expressions are as follows:
24 ∧ 68 ∧ 19 ∧ (50 ∨ 55 ∨ 39)
24 ∧ 68 ∧ 19 ∧ (40 ∨ 54)
The partial ordering of these expressions are shown below (S, E, S1, E1 are
meta symbols). Note the use of ‘OR’ in the boolean expressions which are
shown as ‘|’ in the nodes of the partial order.
68
S1 E1 24 68 19
19
S 24 E S E
50|55|39 40|54

The genes and motifs in this example are enriched for multiple stress response
pathways (e.g., HSPs, Ras) and sensing of extracellular amino acids.

14.2 Extracting (Monotone CNF) Boolean Expressions


Expression mining exhibits traits of conceptual clustering, constructive in-
duction, and logical formula discovery [Fis87, Mic80]. Inducing understand-
able definitions of classes using a set of features [VPPP00] is a goal common
to all descriptive classification applications.
The given input of n sequences (S) on the set of m motifs (F ) is modeled
as an (n × m) boolean  incidence matrix I, without order information, as:
1 if motif mj occurs in sequence si ∈ S,
I[i, j] =
0 otherwise.

In this section, a motif is treated as a boolean variable. Thus a boolean


expression of the form
e = m1 ∧ m2 ∨ m3
472 Pattern Discovery in Bioinformatics: Theory & Algorithms

implies that either both m1 and m2 occur or m3 occurs in a sequence for which
this expression e holds. Alternatively, the same expression can be written as

e = m1 m2 + m3 .

The negation of a variable, m, is written as

m.

DEFINITION 14.1 (expression e, features Π(e), objects O(e)) e is a


boolean expression on a set of motifs

V ⊆ F.

Given e, we denote the set of motifs involved in e by

Π(e)

and the set of sequences it represents by O(e).

We also use the following intuitive convention:

O(e) = O(e = m1 ∧ m2 ∨ m3 )
= O(m1 ) ∩ O(m2 ) ∪ O(m3 )
= O(m1 m2 + m3 ).

For simplicity of notation,

e = m1 ∧ m2 ∨ m3
= m1 ∩ m2 ∪ m3
= m1 m2 + m3 .

The set difference is written as

O(m1 ) \ O(m2 )

or simply
m1 \ m2 .
Two expressions e1 and e2 defined over V1 and V2 respectively are distinct
(denoted as e1 6= e2 ), if one of the following holds:

(i) V1 6= V2 , or

(ii) there exists some input I for which O(e1 ) 6= O(e2 ).


Expression & Partial Order Motifs 473

e Π(e) O(e) √
0 ∅ ∅
m1 m2 {m1 , m2 } {1}
m1 m2 {m1 , m2 } {2}
m1 m2 {m1 , m2 } {3} √
m1 m2 {m1 , m2 } {4}
m1 {m1 } {1, 2}
  m2 {m2 } {1, 3}
m1 m2 m2 m2 + m1 m2 {m1 , m2 } {1, 4}
 0 0  m1 m2 + m1 m2 {m1 , m2 } {2, 3}
  √
 0
I= 1 . m2 {m2 } {2, 4} √
 1 0  m1 {m1 } {3, 4}

1 1 m1 + m1 m2
{m1 , m2 } {1, 2, 3}
m2 + m1 m2 
m1 + m1 m2
{m1 , m2 } {1, 2, 4}
m2 + m1 m2 
m1 + m1 m2
{m1 , m2 } {1, 3, 4}
m2 + m1 m2 
m2 + m1 m2
{m1 , m2 } {2, 3, 4}
m1 + m1 m2 √
1 {m1 , m2 } {1, 2, 3, 4}

FIGURE 14.1: An incidence matrix I and all possible expressions, e, de-


fined
√ on the variables m1 and m2 . The monotone expressions are marked with
.
474 Pattern Discovery in Bioinformatics: Theory & Algorithms

Notice that this condition rules out tautologies. For example, using the set
notation, the expressions
m1 ∩ m4
and
m1 \ (m1 \ m4 )
are not distinct.
An expression e is in conjunctive normal form (CNF) if it is a conjunction
of clauses, where a clause is a disjunction of literals. For example,

e = (m1 + m2 )(m3 + m4 )m5 m6 (m1 + m4 )

is in CNF form.
An expression e is in disjunctive normal form (DNF) if it is a disjunction
of clauses, where a clause is a conjunction of literals. For example,

e = m1 m2 + m3 m4 + m5 + m6 + m1 m4

is in DNF form.
It is straightforward to prove that any expression e can be written either
in a CNF or a DNF form and we leave this as an exercise for the reader
(Exercise 176).
A boolean expression is very powerful since it has the capability of ex-
pressing very complex interrelationships between the variables (motifs in this
case). See Figure 14.1 for an example. However, it is this very same power
that renders it ineffective: it can be shown that there always exists a boolean
expression that precisely represents any collection of rows in any incidence
matrix I. See and Exercise 175 and Figure 14.1 for an example.
So we focus on a particular subclass of boolean expression called monotone
expression [Bsh95]. This is a subclass of boolean expressions that uses no
negation. In other words, it uses only conjunctions and disjunctions. See the
marked expressions in the example of Figure 14.1.
However, an expression is called monotone because it displays monotonic
behavior (see Exercise 177), thus care needs to be taken to determine if an
expression e is monotone. For example consider

e = m1 m2 + m1 m2 ,

that appears not to be monotone due to the presence of m1 . However

e = m1 m2 + m1 m2
= (m1 + m1 )m2
= 1 m2
= m2 .

Thus expression e = m1 m2 + m1 m2 is indeed monotone.


Now we are ready to define the central task:
Expression & Partial Order Motifs 475

Given an incidence matrix I, and a quorum K, the task is to find all


monotone expressions e in the CNF form such that

|O(e)| ≥ K.

Note that since the expressions are restricted to be monotone, this specifica-
tion is nontrivial. We say it is nontrivial since it is possible that there exists
a collection of rows, V , of I such there exist no expression e with

O(e) = V.

In other words, the solution for the problem is not simply all K ′ -sized subsets
of the rows where
K ′ ≥ K.

The algorithm to detect monotone expressions uses the intermediate notion


of biclusters.

14.2.1 Extracting biclusters


Given I, a bicluster is a nonempty collection of columns V and a nonempty
collection of rows O satisfying the following. For each j ∈ V , let a constant
cj be such that
O = {i | I[i, j] = cj , for each j ∈ V }.

Thus O can be determined uniquely by defining V . Note the similarity of


this form with a pattern p defined by the column values cj for each j ∈ V .
Then the location list Lp is simply O and thus a quorum constraint could
be imposed on the size of O. Again in practice, usually the interest is in
biclusters with |V | ≥ 2.
The bicluster is maximal (also sometimes called a closed itemset [Zak04])
if this pattern cannot be expanded with more columns without reducing the
number of rows in O. In other words, there does not exist

j ′ 6∈ V with I[i, j ′ ] = cj ′ ,

for each i ∈ O and some fixed cj ′ . These conditions define the ‘constant
columns’ type of biclusters. See [MO04] for different flavors of biclusters used
in the bioinformatics community.
The bicluster is minimal if for each j ∈ V , the collection of rows O and the
collection of columns
V \ {j}

is no longer a bicluster. In other words, the collection of rows can be expanded


for this smaller collection of columns.
476 Pattern Discovery in Bioinformatics: Theory & Algorithms

   
m1 m2 m3 m4 m5 m6 m7 m1 m2 m3 m4 m5 m6 m7 i
 0
 1 1 1 1 1 1 
 0
 1 1 1 1 1 1  ←1
 1 0 0 1 0 0 0   0 1 0  2
   
 0 1 1 1 0 1 1   0 1 1 1 0 1 1  ←3
I′ = 
   
 1
I = 0 0 1 0 1 1 . 0 1 1 4.

 
 0
 1 1 1 1 1 0 
 0
 1 1 1 1 1 0  ←5
 0 1 0 0 0 0 1   1 0 0  6
   
 0 1 0 1 1 1 0   0 1 0 1 1 1 0  ←7
1 0 0 0 0 0 1 0 0 0 8

A maximal bicluster: A minimal bicluster:


V1 = {m2 , m4 , m6 }, V2 = {m2 , m4 },
O1 = {1, 3, 5, 7}. O2 = {1, 3, 5, 7}.

Corresponding Corresponding
conjunctive form in I: disjunctive form in I:
e1 = m2 m4 m6 , e2 = m2 + m4 ,
O(e1 ) = O1 = {1, 3, 5, 7}. O(e2 ) = O2 = {2, 4, 6, 8}.

FIGURE 14.2: I is an incidence matrix. Let quorum K = 3. I ′ shows the


rows and the columns corresponding to the bicluster V1 , O1 .

Relationship between biclusters and expressions. Let V , O be a bi-


cluster and let expression e be the conjunction of the literals corresponding
to the columns in V . Then O is the same as the support
O(e) = {i | row i satisfies e}.
Thus conjunction of literals (or columns in I) correspond naturally to biclus-
ters.
In the spirit of irredundancy of Chapter 4, it can be argued that maximal
biclusters when the corresponding expression is a conjunction of literals, and
minimal biclusters when the expression is a disjunction, can be considered
irredundant. Note that all the other redundant expressions (any subset of the
columns of V for conjunctions and any superset of V for disjunctions) can be
trivially obtained from these in both the cases.
Thus it is meaningful to have maximal biclusters for conjunctions of literals
but minimal biclusters for disjunctions of literals. But, to compute disjunc-
tions from the incidence matrix I, we must resort to careful negations as
summarized in the following lemma.

LEMMA 14.1
(Flip lemma) Given an (n × m) incidence matrix I, for some 1 ≤ l ≤ m,
e = f1 ∨ f2 ∨ . . . ∨ fl ,
Expression & Partial Order Motifs 477

is a minimal disjunction with support O(e) if and only if


e = f1 ∧ f2 ∧ . . . ∧ fl
is a miniimal conjunction with O(e) = {i | 1 ≤ i ≤ m AND i 6∈ O(e)}.

Figure 14.2 shows an example of maximal and minimal biclusters and the
corresponding expressions. Note that I is defined as

1 if Iij = 0,
I ij =
0 if Iij = 1.
The reader is directed to Chapter 13 for algorithms on finding maximal and
minimal biclusters, which is mapped to the problem of finding maximal and
minimal set intersections respectively. We leave the mapping of this con-
struction as Exercise 180 for the reader. Using Lemma (14.1), the mining of
monotone CNF expressions is staged as follows.
1. Find all minimal monotone disjunctions in I, by performing the follow-
ing two substeps:
(a) Find all minimal conjunctions on I.
(b) Extract all minimal monotone disjunctions by negating each of
these computed minimal conjunctions stored in T (see Lemma 14.1).
For example, if the minimal conjunction is e = f 1 ∧ f 2 (since I is
used) then the minimal disjunction is e′ = f1 ∨ f2 . Let the number
of minimal disjunctions computed be d.
2. Copy matrix I to I ′ . Augment this new matrix I ′ with the results of
the last step as follows. For each minimal disjunction form e′ , introduce
a new column c in I ′ with
if i ∈ O(e′ ),

′ 1
I [i, c] =
0 otherwise.
The augmented matrix, I ′ , is then of size n × (m + d). Next, find all
monotone conjunctions as maximal biclusters in I ′ .
A concrete example is shown below. To avoid clutter, I is not shown. However
e = m1 m2 is detected as a minimal (conjunction) bicluster in I with support
{3, 4}. Thus I ′ has the new column m1 + m2 with support {1, 2, 5}. Some
solutions on I ′ are shown below.

m1 m2 m3 m1 m2 m3 m1 +m2 Some solutions:


1 1 0 1 1 1 0 1 1 e1 = m1 m3 ,
2 0 1 1 2 0 1 1 1 O(e1 ) = {1, 5}.
⇒ ⇒
3 0 0 1 3 0 0 1 0
4 0 0 0 4 0 0 0 0 e2 =m3 (m1 +m2 ),
5 1 0 1 5 1 0 1 1 O(e2 ) = {1, 2, 5}.
I I′
478 Pattern Discovery in Bioinformatics: Theory & Algorithms

14.2.2 Extracting patterns in microarrays


The similarity of extracting patterns or biclusters from a real matrix M
(unlike the binary matrix I) with the task discussed in the last section is
startling, so we briefly digress here to compare and contrast the two.

Microarrays. A DNA microarray, also known as gene chip or a DNA chip,


is a matrix of microscopic DNA spots attached to a solid surface, such as glass
or silicon, for the purpose of monitoring expression levels of a large number of
genes simultaneously. This is called expression profiling. The affixed DNA
oligomers are known as probes.
Measuring gene expression using microarrays is relevant to many areas of
biology. For instance, microarrays can potentially be used to identify genes
causing diseases by comparing the expression profiles of disease affected and
normal cells.
For our purposes, the real matrix M is a direct conversion of the extent of
gene expression in position (i, j) which is the expression of gene numbered j
in the cell numbered i. However, it is important to keep in mind that the raw
data from the microarray must undergo some form of normalization before
subjecting it to any form of analysis. Literature abounds with proposed nor-
malization methods and a healthy debate continues over what an appropriate
model is.
Notwithstanding the unresolved debate, it is important to note that the
microarray technology provides an astounding increase in the dimensionality
of the data over a traditional experiment such as a clinical study. Such a
study may gather hundreds of data items for each patient. However, even a
medium-sized microarray has the capability and can obtain thousands of data
items (like gene expression) per sample. The need to extract ‘knowledge’ from
this data set is undoubtedly a challenging task.
In the discussion here we focus on the normalized real matrix M and the
task is to extract patterns seen in the arrays. A pattern is a bicluster V , O
where V is a collection of genes and O is a collection of samples. In other
words,

A bicluster (V , O) denotes the set of genes represented by V that show


similar expression in the samples represented by O.

Back to eliciting biclusters. Consider column j of the real n × m matrix


M . Assume that for this column (gene numbered j), two expression values,
v1 and v2 are considered to be equal if and only if, for some fixed δj ≥ 0,

|v1 − v2 | < δj .

Then using the technique, described in Section (6.6.2), we convert this to


a column defined on some alphabet Fj . Let the size of this alphabet be nj ,
Expression & Partial Order Motifs 479

then clearly,
nj = |Fj | ≤ n.
Let
Fj = {σj1 , σj2 , . . . , σjnj }.
Thus using some m fixed values
δ1 , δ2 , . . . , δm ,
the real matrix M is transformed to a matrix Q with discrete values, i.e.,
Q[i, j] ∈ Fj for each i and j.
Next, we stage the bicluster detection problem (or pattern extraction from
microarrays) as a maximal set intersection problem in the following steps.
1. For each column j and for each symbol σjk where 1 ≤ k ≤ jn , compute
the following sets of rows:
Sjσjk = {i | Q[i, j] = σjk }.
This gives mnj nonempty sets.
2. Invoke an instance of the maximal set intersection problem (of Sec-
tion (13.6), with the mnj nonempty sets and quorum K.
3. The solution of Step 2 is mapped back to the solution of the original
(bicluster) problem. This is a straightforward process and we leave the
details as an exercise for the reader (Exercise 180).
A simple concrete example is shown below. Let δj = 0.5, for 1 ≤ j ≤ 3.
Only nonsingleton sets are shown in Step 1. The two bicluster patterns are
shown in the input array at the bottom.
g1 g2 g3 g1 g2 g3
1 1.0 3.1 2.85 1 a d a S3a = {1, 3},
S = {1, 3, 4},
2 2.0 2.5 3.4 ⇒ 2 b c b, c ⇒ 1a ∩ S3b = {2, 3},
S1b = {2, 4}.
3 1.25 1.9 3.1 3 a b a, b S3c = {2, 4}.
4 1.5 0.7 3.7 4 a, b a c
M Q Maximal Set Intersection Problem
g1 g2 g3 g1 g2 g3
1 1.0 3.1 2.85 ← 1 1.0 3.1 2.85
2 2.0 2.5 3.4 2 2.0 2.5 3.4 ←
⇒ 3 1.25 1.9 3.1 ← 3 1.25 1.9 3.1
4 1.5 0.7 3.7 4 1.5 0.7 3.7 ←
↑ ↑ ↑ ↑
S1a ∩ S3b ={1, 3} S1b ∩ S3c ={2, 4}
This method of detecting bicluster patterns has even been applied to protein
folding data in an attempt to understand the folding process at a higher
level [PZ05, ZPKM07].
480 Pattern Discovery in Bioinformatics: Theory & Algorithms

14.3 Extracting Partial Orders


In this step, we restore the order information among motifs, for each mined
expression. We group disjunctions into a ‘meta-motif’ (as done in the example
in Section (14.1.1)) so that we can view all gene sequences as sequences of the
same length over the alphabet of meta-motifs. Before we detail the specifics,
we briefly review partial orders and related terminology.

14.3.1 Partial orders


Let F be a finite alphabet F ,

F = {m1 , m2 , . . . , mL },

where
L = |F |.
A binary relation B,
B ⊂ F × F,
(a subset of the Cartesian product of F ) is a partial order if it is
1. reflexive,
2. antisymmetric, and
3. transitive.
For any pair m1 , m2 ∈ F ,
m1  m2 if and only if (m1 , m2 ) ∈ B.
In other words. (m1 , m2 ) 6∈ B if and only if m1 6 m2 .
A string q is compatible with B, if for no pair m2 preceding m1 in q, m1  m2
holds in B. In other words, the order of the elements in q does not violate the
precedence order encoded by B. A compatible q is also called an extension
of B. q is a complete extension of B, if q contains all the elements of the
alphabet F . Such a q is also called a permutation on F . Also,

P rm(F ) = {q | q is permutation on F },
Cex(B) = {q | q is a complete extension of B}.
2
P rm(F ) is the set of all possible permutations on F . Thus

Cex(B) ⊆ P rm(B).

2 If|F | = n, P rm(F ) can be related to Sn in combinatorics, which is the group of permu-


tations of {1, 2, . . . , n}.
Expression & Partial Order Motifs 481

B can also be represented by a directed graph G(F, E ′ ) where edge


(m1 m2 ) ∈ E ′ , if and only if m1  m2 .
Since B is antisymmetric, it is easy to see that G(F, E ′ ) is a directed acyclic
graph (DAG).
If E is the smallest set of edges such that when m1  m2 , there is a path
from m1 to m2 in E, then G(F, E) is called the transitive reduction of B.
The following property of the edge set E of a transitive reduction of a partial
order is easily verified (Exercise 182).

LEMMA 14.2
(Reduced-graph lemma) If G(F, E) is the transitive reduction of a partial
order, then for any pair, m1 , m2 ∈ F , if there is a directed path of length
larger than 1 from m1 to m2 in G(F, E), then edge (m1 m2 ) 6∈ E.

In the following discussion, the transitive reduction of a partial order B


will be denoted by its DAG representation G(F, E). If the directed edge
(m1 , m2 ) ∈ E, then m1 is called the parent of m2 . Similarly, m2 is called the
child of m1 . If there is a directed path from m1 to m2 in E, then m2 is called
a descendant of m1 . Further, we say sequence q ∈ G(F, E) if q is compatible
with the partial order represented by G(F, E).

14.3.2 Partial order construction problem


Problem 27 Given m permutations qi , 1 ≤ i ≤ m, each defined on F , 1 ≤
i ≤ m, the task is to construct the transitive reduction of a partial order
G(F, E),
satisfying the following. For each pair m1 , m2 ∈ F , m2 is a descendant of m1
if and only if m1 precedes m2 in all the given m permutations.
Does such a DAG always exist given some (nonempty) input permutations?
In fact, the solution to Problem 27 always exists and this transitive reduction
DAG
G(F, E)
is unique. Our interest is in constructing this transitive reduction of the
partial order DAG. The following property is crucial and is also central to the
algorithm design.

THEOREM 14.1
(Pair invariant theorem) Let
G(F, E)
be the solution to Problem 27. (m1 m2 ) ∈ E, if and only if for each qi ,
482 Pattern Discovery in Bioinformatics: Theory & Algorithms

1. m1 precedes m2 and

2. the following set is empty:


 
m1 precedes m and
L(m1  m2 ) = m ∈ F .
m precedes m2 , in each qi

The proof is left as an exercise for the reader (Exercise 184). Note the
equivalence of the following sets:
 
m1 precedes m and
L(m1  m2 ) = m ∈ F
m precedes m2 , in each qi
 
m is a descendent of m1 and
= m∈F .
m2 is a descendent of m, in G(F, E)

Partial order construction algorithm. Theorem (14.1) is used to design


an incremental algorithm. Each qi is padded with a start character

S 6∈ F

and an end character


E 6∈ F.
This ensures the two following properties:

1. The resulting DAG has exactly one connected component.

2. The only vertex with no incoming edges is S and the only vertex with
no outgoing edges is E.

The algorithm is described using a concrete example in Figure 14.3. The


DAG is initialized as a ‘chain’ representing q1 . The generic step for qi , 1 <
i ≤ m is defined as follows. At each iteration

Gi+1 (F, E i+1 )

is constructed from
Gi (F, E i )

q1 = 1 2 3 4 3 4 3
q2 = 3 1 4 2 S E S 4 E
q3 = 4 1 3 2 S 1 2 3 4 E 1 2 1 2
(a) Input. (b) G1 . (c) G2 . (d) G3 .
FIGURE 14.3: Incremental construction of G(F, E) = G3 .
Expression & Partial Order Motifs 483

of the previous step. Finally,


Gm (F, E m ) = G(F, E)
is the required solution to the problem. At step i+1,
E i+1 = E ′ ∪ E ′′ ,
where E ′ and E ′′ are defined (constructed) as follows.
1. E ′ is the set of edges that ‘survive’ the new permutation qi+1 and is
defined (constructed) as follows:
E ′ = {(m1 m2 ) ∈ E i | m1 precedes m2 in qi+1 }.

2. E ′′ is the set of new edges that must be added to the DAG due to the
qi+1 and is defined (constructed) as follows. A pair of characters m1
and m2 are imm-compatible if all of the following hold:
(a) m1 precedes m2 in qi+1 with
L′ = {m | m1 precedes m and m precedes m2 in qi+1 }.
(b) m1 is an ancestor of m2 in Gi (F, E i ) with
L′′ = L(m1  m2 ).
(c) L′ ∩ L′′ = ∅.
Then
E ′′ = {(m1 m2 ) 6∈ E i | m1 and m2 are imm-compatible}.

The proof of correctness of the algorithm is left as an exercise for the reader
(Exercise 185).
What is the size of a reduced partial order graph, in the worst case? See
Exercise 158 of Chapter 13.

14.3.3 Excess in partial orders


Consider the concrete example discussed in the last section and reproduced
here in Figure 14.4. The reduced partial order graph has the smallest number
of edges that respect the order information in all the input permutations, yet
q4 , q5 , q6 ∈ G(F, E),
which are not part of the input collection of permutations. Thus it is clear
that the partial order captures some but not all of the subtle ordering infor-
mation in the data. Let I be the collection of input permutations then
Cex(B) ⊇ I.
The gap,
Cex(B) \ I,
is termed excess.
484 Pattern Discovery in Bioinformatics: Theory & Algorithms

Handling excess with PQ structures. A heuristic using PQ trees to


reduce excess is discussed here.

1. The alphabet F is augmented with a new set of characters F ′ to obtain

F ∪ F ′.

Also
|F ′ | = O(|F |).

2. A complete extension q ′ on (F ∪ F ′ ) is converted to q on F by simply


removing all the
m′ ∈ F ′ .

This is best explained through an example. Let I be given as follows.

q1 = a b c d e g f,
q2 = b a c e d f g,
q3 = a c b d e f g,
q4 = e d g f a b c,
q5 = degfbac

with
F = {a, b, c, d, e, f, g.}.
However, notice that certain blocks appear together as follows:

q1 = a b c de gf ,
q2 = b a c ed fg ,
q3 = a c b de fg ,
q4 = e d gf abc ,
q5 = d e gf bac .

The reduced partial order graph for this input is shown in Figure 14.5(a).
In G′ , the alphabet is augmented with

F ′ = {S1, E1, S2, E2, S3, E3},

to obtain a ‘tighter’ description of I.

q1 = 1 2 3 4 3 q4 = 4 1 2 3
q2 = 3 1 4 2 S 4 E q5 = 1 3 4 2
q3 = 4 1 3 2 1 2 q6 = 1 4 2 3
(a) Input I. (b) The reduced partial order DAG G. (c) Compatible q’s.
FIGURE 14.4: Incremental construction of G(F, E) = G3 .
Expression & Partial Order Motifs 485

a c S1 a c E1
b b
S E S E
d f d S3 f
e g S2 e E2 g E3

(a) Partial order, G (b) Augmented partial order, G′


FIGURE 14.5: G′ is a specialization of G: If q ∈ G′ , then q ∈ G, but not
vice-versa.

The boxed elements in Figure 14.5(b) are clusters that always appear to-
gether, i.e., are uninterrupted in I. For example, if

q = a b d e f c g,

then clearly
q ∈ G,
but since the cluster {f, g} is interrupted by c and also the cluster {a, b, c} is
interrupted,
q 6∈ G′ .
These clusters are flanked by meta symbols, Si on the left and Ei on the
right, in the augmented partial order. Thus this scheme forces the elements
of the cluster to appear together, thus reducing excess.
We leave the details of this scheme as Exercise 188 for the reader.

14.4 Statistics of Partial Orders


We need compute the number of permutations that are compatible with a
given partial order B.
Figure 14.6 shows examples of some partial orders, B along with P rm(B)
and Cex(B).
An empty partial order is the one whose DAG has only two vertices, repre-
senting the start and the end symbol. The inverse partial order B, is obtained
by

1. switching the start and end symbols and

2. reversing the direction of every edge in the partial order DAG.

Figure 14.7 shows an example of a partial order and its inverse. If a partial
order B is such that
Cex(B) = Cex(B),
486 Pattern Discovery in Bioinformatics: Theory & Algorithms

1
3 1 2
1 S E
S 4 E S 2 4 E 3
2 3 4
(a) Partial order B1 (b) Partial order B2 (c) Partial order B3
√ √ √ √
1 2 3 4 √ 4 3 2 1 × 1 2 3 4 √ 4 3 2 1 × 1 2 3 4 4 3 2 1
√ √
1 2 4 3 √ 3 4 2 1 × 1 2 4 3 √ 3 4 2 1 × 1 2 4 3 3 4 2 1
√ √
1 3 2 4 √ 4 2 3 1 × 1 3 2 4 √ 4 2 3 1 × 1 3 2 4 4 2 3 1
√ √
1 3 4 2 √ 2 4 3 1 × 2 1 3 4 √ 4 3 1 2 × 2 1 3 4 4 3 1 2
√ √
1 4 2 3 √ 3 2 4 1 × 2 1 4 3 √ 3 4 1 2 × 2 1 4 3 3 4 1 2
√ √
1 4 3 2 √ 2 3 4 1 × 2 3 1 4 √ 4 1 3 2 × 2 3 1 4 4 1 3 2
√ √
2 1 3 4 √ 4 3 1 2 × 2 3 4 1 √ 1 4 3 2 × 2 3 4 1 1 4 3 2
√ √
2 1 4 3 3 4 1 2 × 2 4 1 3 √ 3 1 4 2 × 2 4 1 3 3 1 4 2
√ √
2 3 1 4 × 4 1 3 2 × 2 4 3 1 √ 1 3 4 2 × 2 4 3 1 1 3 4 2
√ √
2 4 1 3 × 3 1 4 2 × 3 1 2 4 √ 4 2 1 3 × 3 1 2 4 4 2 1 3
√ √
3 1 2 4 × 4 2 1 3 × 3 2 1 4 √ 4 1 2 3 × 3 2 1 4 4 1 2 3
√ √
3 2 1 4 × 4 1 2 3 × 3 2 4 1 1 4 2 3 × 3 2 4 1 1 4 2 3
(d) Cex(B1 ) (e) Cex(B2 ) (f) Cex(B3 )

FIGURE 14.6: Partial orders, each on elements F = {1, 2, 3, 4} shown in


the top row. The bottom row shows all 24 permutations of√F . The permuta-
tions compatible with the partial order are marked with and the rest are
marked with ×. Also, each permutation in the boxed array is the inverse of
the one to its left in the unboxed array.

then B is a degenerate partial order. The partial order B3 in Figure 14.6 is


degenerate. Note that

Cex(B3 ) = Cex(B 3 ) = P rm(B3 ).

The proof, that this is the only nonempty degenerate partial order, is straight-
forward and we leave it as an exercise for the reader (Exercise 189). In other
words, we can focus on just nondegenerate partial orders.
For a nondegenerate partial order B, what is the relationship between

Cex(B) and Cex(B)?

2 1 2 1
S E E S
3 4 3 4
(a) Partial order B. (b) Partial order B.
FIGURE 14.7: A partial order B and its inverse partial order B.
Expression & Partial Order Motifs 487

1 1
S 2 4 E E 2 4 S
3 3
(a) Partial order B (b) Partial order B
√ √
1 2 3 4 √ 4 3 2 1 × 1 2 3 4 × 4 3 2 1 √
1 2 4 3 √ 3 4 2 1 × 1 2 4 3 × 3 4 2 1 √
1 3 2 4 √ 4 2 3 1 × 1 3 2 4 × 4 2 3 1 √
2 1 3 4 √ 4 3 1 2 × 2 1 3 4 × 4 3 1 2 √
2 1 4 3 √ 3 4 1 2 × 2 1 4 3 × 3 4 1 2 √
2 3 1 4 √ 4 1 3 2 × 2 3 1 4 × 4 1 3 2 √
2 3 4 1 √ 1 4 3 2 × 2 3 4 1 × 1 4 3 2 √
2 4 1 3 √ 3 1 4 2 × 2 4 1 3 × 3 1 4 2 √
2 4 3 1 √ 1 3 4 2 × 2 4 3 1 × 1 3 4 2 √
3 1 2 4 √ 4 2 1 3 × 3 1 2 4 × 4 2 1 3 √
3 2 1 4 √ 4 1 2 3 × 3 2 1 4 × 4 1 2 3 √
3 2 4 1 1 4 2 3 × 3 2 4 1 × 1 4 2 3
(c) Cex(B) (d) Cex(B)

FIGURE 14.8: A partial order and its inverse is shown in the top row.
The next row shows all 24 permutations
√ in P rm(·) for each. The elements
of Cex(·) are marked with and the rest are marked with ×. Also, each
permutation in the boxed array is the inverse of the one to its left in the
unboxed array.
488 Pattern Discovery in Bioinformatics: Theory & Algorithms

3 3
1 1
S 4 E E 4 S
2 2
(a) Partial order B (b) Partial order B
√ √
1 2 3 4 √ 4 3 2 1 × 1 2 3 4 × 4 3 2 1 √
1 2 4 3 √ 3 4 2 1 × 1 2 4 3 × 3 4 2 1 √
1 3 2 4 √ 4 2 3 1 × 1 3 2 4 × 4 2 3 1 √
1 3 4 2 √ 2 4 3 1 × 1 3 4 2 × 2 4 3 1 √
1 4 2 3 √ 3 2 4 1 × 1 4 2 3 × 3 2 4 1 √
1 4 3 2 √ 2 3 4 1 × 1 4 3 2 × 2 3 4 1 √
2 1 3 4 √ 4 3 1 2 × 2 1 3 4 × 4 3 1 2 √
2 1 4 3 3 4 1 2 × 2 1 4 3 × 3 4 1 2
2 3 1 4 × 4 1 3 2 × 2 3 1 4 × 4 1 3 2 ×
2 4 1 3 × 3 1 4 2 × 2 4 1 3 × 3 1 4 2 ×
3 1 2 4 × 4 2 1 3 × 3 1 2 4 × 4 2 1 3 ×
3 2 1 4 × 4 1 2 3 × 3 2 1 4 × 4 1 2 3 ×
(c) Cex(B) (d) Cex(B)

FIGURE 14.9: A partial order and its inverse is shown in the top row. The
next row shows all√24 permutations in P rm(·) for each. The elements of Cex(·)
are marked with and the rest are marked with ×. Also, each permutation
in the boxed array is the inverse of the one to its left in the unboxed array.
Notice that there are some permutations that belong to neither Cex(B) nor
Cex(B).
Expression & Partial Order Motifs 489

It is instructive to study the two examples shown in Figures 14.8 and 14.9.
We leave the proof of the following lemma as Exercise 190 for the reader.

LEMMA 14.3
Let B be a nondegenerate partial order.

1. If q ∈ Cex(B), then q 6∈ Cex(B).

2. The converse of the last statement is not true, i.e., there may exist
q ∈ P rm(B) such that

q 6∈ Cex(B) and q 6∈ Cex(B).

3. If q ∈ Cex(B), then q ∈ Cex(B).

4.

|Cex(B)| = |Cex(B)|,
|P rm(B)|
|Cex(B)| ≤ .
2

LEMMA 14.4
(Symmetric lemma) For a nondegenerate partial order B, if the DAG of
B is isomorphic to the DAG of B, then

n!
|Cex(B)| = |Cex(B)| = .
2

The proof is left as an exercise for the reader (Exercise 191).

14.4.1 Computing Cex(B)


Given a partially ordered set B, how hard is it to count the number of
complete extensions? There are a few special cases where there is a sim-
ple algorithm, but in general it is hard. Specifically, it is #P-complete
[BW91a, BW91b], i.e., it is as hard as counting the number of satisfying
assignments of an instance of the satisfiability problem (SAT). However, it
can be approximated by using the polynomial time algorithm for computing
the the volume of either the order polytope or the chain polytope of B [DFK]
and the fact that the number of complete extensions of an n-element partial
order B is equal to n! times the volume of either of the polytopes.
A discussion on this method is beyond the scope of this book. Other re-
searchers have assumed restrictions, e.g., Mannila and Meek [MM00] restrict
their partial orders to have an MSVP (minimal vertex series-parallel) DAG
to get a handle on the problem. However, we give here an exposition on a
490 Pattern Discovery in Bioinformatics: Theory & Algorithms

much simpler scheme to estimate the lower and upper bounds of the size of
Cex(B).
Consider the DAG
G(V, E)

of a nondegenerate partial order B. Let the depth of a node v ∈ V , depth(v),


be the largest distance from the start symbol S, where the distance is in terms
of the number of nodes in the path from S to v (excluding S).
q ′ is defined to be a subsequence of q if

1. Π(q ′ ) ⊆ Π(q) and

2. if for any pair σ1 , σ2 ∈ Π(q ′ ), σ1 precedes σ2 in q ′ , then σ1 must precede


σ2 in q.

For example, given

q = a b c d e,
q1 = a c e,
q2 = c a,

q1 is a subsequence of q but q2 is not.


Let q1 and q2 be such that

Π(q1 ) ∩ Π(q2 ) = ∅.

Then
q = q1 ⊕ q2 ,

is defined as follows:

1. Π(q) = Π(q1 ) ∪ Π(q2 ) and

2. q1 and q2 are subsequences of q.

Columns of the grid. For each v, assign depth

col(v) = depth(v).

For each distinct depth i,

Ci = {v | col(v) = i}.

The depth(v) of each v can be computed in linear time using a breadth first
traversal (BFS) of the DAG (see Chapter 2).
Expression & Partial Order Motifs 491

S E S E

(1) (2)
FIGURE 14.10: Two possible grid assignments of nodes of a partial order
DAG. The C’s are the same but the R’s differ in the two assignments.

Rows of the grid. Let


v1 v2 v3 . . . vl
be the vertices along a path on the DAG such that

col(v1 ) < col(v2 ) < . . . < col(vl ).

Then
row(v1 ) = row(v2 ) = . . . = row(vl ).
A depth first traversal (DFS) of the DAG (see Chapter 2) can be used to
compute row(v) for each v satisfying these constraints.
Let
Ri = {v | row(v) = i}.
Let the number of nonempty C’s be c and let the number of nonempty R’s
be r. We use the following convention:

ni = |Ri |, for 1 ≤ i ≤ r.

At the end of the process row(v) and col(v) have been computed for each
v. It is possible to obtain different values of row(v) satisfying the condition.
But that does not matter. We are looking for a small number of R sets with
as large a size (|R|) as possible. However, this is only a heuristic to simply put
the vertices on a virtual grid (i, j). See Figure 14.10 for a concrete example.
Let
col(B) = {q = q1 q2 . . . qc | Π(qi ) = Ci , for 1 ≤ i ≤ c}.
The following observation is crucial to the scheme:

If q ∈ col(B), then q ∈ B.

Note that col(B) does miss a few extensions of B, since the vertices of each
column are in strict proximity. Thus

col(B) ⊆ Cex(B). (14.1)


492 Pattern Discovery in Bioinformatics: Theory & Algorithms

Also, the size of col(B) is computed exactly as follows:


c
Y
|col(B)| = |Ci |!
i=1

Let  
qi = v1 v2 . . . vni ,
row(B) = q1 ⊕ q2 ⊕ . . . ⊕ qc .
for 1 ≤ i ≤ c
Again, the following observation is crucial to the scheme:
If q ∈ B, then q ∈ row(B).

Note that each q ∈ B must also belong to row(B), since no order of the
elements is violated. However, some q ∈ row(B), may violate the order, since
there are some edges that go across the R rows, which is not captured by the
row(B) definition. Thus
Cex(B) ⊆ row(B). (14.2)
Also, the size of col(B) is computed exactly as follows (see Exercise 187 for
details of this computation):
    
n1 + n2 n1 + n2 + n3 n1 + n2 + .. + nr
|row(B)| = ... .
n2 n3 nr
In conclusion,
col(B) ⊆ Cex(B) ⊆ row(B).
For a nondegenerate partial order B,
 
|V |!
|col(B)| ≤ |Cex(B)| ≤ min |row(B)|, ,
2
where V is the set of vertices in the DAG of B. Thus the sizes of row(B)
and col(B) can be used as coarse lower and upper bounds of |Cex(B)|. It is
possible to refine the bounds by trying out different assignments of row(v)
(note that for a v, col(v) is unique).

Back to probability computation. We pose the following question:


What is pr(B), the probability of a permutation of all elements of F ,
being compatible with a given partial order B defined on F ?
The total number of permutations of the elements of F is
|F |!
Then, the probability is given as
|Cex(B)|
pr(B) =
|F |!
Expression & Partial Order Motifs 493

An alternative view. Let B be a partial order on F . We label the nodes


of the partial order by integers

1, 2, . . . , |F |

in any order. Let q be a random permutation of integers integers

1, 2, . . . , |F |.

See Section (5.2.3) for a definition of random permutation. Then the proba-
bility, pr(B), of the occurrence of the event

q ∈ Cex(B)

is given by
|Cex(B)|
pr(B) = .
|F |!

14.5 Redescriptions
We have already developed the vocabulary to appreciate a very interesting
idea called redescriptions, introduced by Naren Ramakrishnan and coauthors
in [RKM+ 04]. This can be simply understood as follows. For a given incidence
matrix I, if distinct expressions

e1 6= e2

are such that


O(e1 ) = O(e2 ),
then e1 is called a redescription of e2 and vice-versa. In other words, given a
data set I, e1 and e2 provide alternative description for some sample set (or
set of rows in I) denoted by O(e1 ). Usually if

Π(e1 ) ∩ Π(e2 ) = ∅,

then the implications are even stronger since the alternative description or
explanation is over a different set of features (or columns).
A redescription is hence a shift-of-vocabulary; the goal of redescription min-
ing is to find segments of the input that afford dual definitions and to find
those definitions. For example, redescription may suggest alternative path-
ways of signal transduction that might target the same set of genes. However,
the underlying premise is that input sequences can indeed be characterized
in at least two ways using some definition (say boolean expression or partial
orders or partial orders of expressions). Thus for instance in cis-regulatory
494 Pattern Discovery in Bioinformatics: Theory & Algorithms

regions, existence of redescriptions signify that possibly these must be under


concerted combinatorial control (and likely to lie upstream of functionally
related genes).
This is only a brief introduction to an exciting idea and the reader is directed
to [RKM+ 04, PR05] for further reading on this topic.

14.6 Application: Partial Order of Expressions


We discuss here a possible application of the detection of partial order on
expressions. Such a complex mechanism is termed combinatorial control in
the following discussion [RP07].
Combinatorial control [RSW04]—the use of a small number of transcrip-
tion factors in different combinations to realize a range of gene expression
patterns—is a key mechanism of transcription initiation in eukaryotes. Many
important processes, such as development [LD05], stress response, and neu-
ronal activity, universally rely on combinatorial control to accomplish a di-
versity of cellular functions.
How do a given set of transcription factors determine which genes to acti-
vate or repress? Since genomic DNA is tightly packaged with proteins inside
the nucleus, initiation of transcription is an elaborate process involving chro-
matin remodeling around the gene of interest, recruitment of transcription
activators, promoter-specific binding and assembly of the transcription appa-
ratus, culminating in formation of protein-DNA complexes. Furthermore, for
a given regulatory module, varying degrees of occupancy by one or more tran-
scription factors, specificity of binding, co-operativity among activators and
repressors, and the resulting stability of the assembled complex all contribute
to the richness of gene expression observed in the cell.
Understanding how multiple signals are combinatorially integrated in a
given regulatory module is thus a complex problem and can be broken down
into meaningful subproblems [LW03]. First, what are the binding sites (cis-
regulatory regions) influencing gene expression? This question has been stud-
ied by elucidating structures of regulatory protein-DNA complexes, promoter
prediction algorithms [PBCB99], as well as by comparative genomic sequence
analysis [KPE+ 03].
Second, which transcription factors occupy these sites? Genome-wide loca-
tion analysis techniques such as ChIP-chip [SSDZ05] are relevant here.
Third, how can transcription factor occupancy be related to gene expres-
sion? Many researchers have exploited the computational nature of this
problem and proposed predictive models ranging from simple boolean gates
[BGH03] to complex circuits [ID05]. Others have adopted a descriptive ap-
proach, and identified combinations of promoter motifs that characterize or
Expression & Partial Order Motifs 495

explain observed co-expression of genes [SS05, PSC01].

Finally, what is the cellular context or range of environmental conditions


that actually trigger transcription? The condition-specific use of transcription
factors is studied in, e.g., Segal et al. [SSR+ 03] and Harbison et al. [Hea04].

An emerging trend is to model the entire sequence and structural context


in which transcription occurs, in particular capturing the physical layout and
arrangement of cis-regulatory regions and how variations in their ordering
and spacing affect the specificity of the set of genes transcribed. Some tran-
scription factors influencing combinatorial control are not sensitive to spacing
between binding sites whereas others can be influenced by changes of even one
nucleotide in the spacer region. At a higher level [Chi05], overlapping sites
might prevent simultaneous binding whereas far away sites may require DNA
looping to initiate transcription.

Furthermore, since the transcription machinery is a multi-protein complex


with structural constraints, accessibility to binding sites is crucial and even
a slight permutation of the cis-regulatory regions can prevent binding. A
striking example is given in [Chi05], of yeast sulfur metabolism, where one
permutation of binding sites (Cbf1, Met31) induces expression of the HIS3
reporter gene, whereas the reverse permutation (Met31, Cbf1) does not.

Hence, although cis-regulatory regions are very short (≈ 5-15 base pairs),
they can co-occur in symbiotic, compensatory, or antagonistic combinations.
Hence, characterizing permutation and spacing constraints underlying a fam-
ily of transcription factors can possibly help in understanding how genes are
selectively targeted for expression in a given cell state.

14.7 Summary

This chapter discusses a very general, almost descriptive characterization of


sequences in terms of permutations and partial orders on (boolean) expressions
on motifs. This attempts to answer complex questions: For instance, given
a set of sequences, can the order constraints, say in their upstream regions,
be characterized? Also, can the concerted clusters of sequences (genes) be
identified that exhibit distinctive order constraints?
496 Pattern Discovery in Bioinformatics: Theory & Algorithms

14.8 Exercises
Exercise 175 (Boolean expression) Consider the following incidence ma-
trix:  
m1 m2 m3 m4 m5
 1 0 1 1 0 
 
I=  1 1 0 1 0 

 0 0 1 0 0 
1 1 0 1 0
1. Construct boolean expressions, e1 , e2 and e3 on the four motifs where
each satisfies the equation below.

O(e1 ) = {1, 3},


O(e2 ) = {1, 3, 4},
O(e1 ) = {1, 2, 3, 4}.

2. Show that for any subset, Z, of {1, 2, 3, 4}, there exists e such that

O(e) = Z.

3. Enumerate all the distinct expressions on the five variables

m1 , m2 , . . . , m5

defined bu I.
Hint: 2. For each row i construct an expression ei and then e is constructed
from this collection of ei ’s.

Exercise 176 (Boolean variables)


1. For boolean variables m1 and m2 show that

m1 + m2 = m1 m2 ,
m1 m2 = m1 + m2 .

2. Show that any expression e can be written as a CNF.


3. Show that any expression e can be written as a DNF.
Hint: 1. This is called the De Morgan’s laws. Use definition of negation of
expressions to prove this. 2. & 3. Use De Morgan’s laws.
Expression & Partial Order Motifs 497

Exercise 177 (Monotone expression) Let e be a monotone expression.

1. Show that if the value of any variable in Π(e) is changed from 0 to 1,


then the value of e only ‘increases’, i.e.,

(a) either from 0 to 1 or remains at 0 or remains at 1,


(b) but never from from 1 to 0.

2. Similarly, show that if the value of any variable in Π(e) is changed from
1 to 0, then the value of e only ‘decreases’, i.e.,

(a) either from 1 to 0 or remains at 0 or remains at 1,


(b) but never from from 0 to 1.

This is called the monotone behavior of e.

Hint: 1. & 2. A monotone expression has only conjunctions and disjunctions.


Consider conjunctions and then disjunctions.
Note that the notion of monotone boolean expression is analogous to that of
a monotonically increasing function defined on R.

Exercise 178 (Flip lemma) For a given incidence matrix I, let a polymor-
phic S be defined as follows:

S[1] = {i | i is a row in I},


S[2] = {f | f is a column in I}.

Next, given a polymorphic S, let S be defined as follows:

S[1] = {i | i 6∈ S[1]},
S[2] = {f | f ∈ S[2]}.

If
e1 = m1 m2 . . . ml
and the pair
S[1] = Π(e1 ) and S[2] = O(e1 )
is a minimal (maximal resp.) bicluster in I then the pair

S[1] = Π(e2 ) and S[2] = O(e2 )

is a minimal (maximal resp.) bicluster in I where

e2 = m1 + m2 + . . . + ml .
498 Pattern Discovery in Bioinformatics: Theory & Algorithms

Hint: See the example in Figure 14.2.

Exercise 179 (Minimal bicluster) Consider I of Figure 14.2.


 
m1 m2 m3 m4 m5 m6 m7
 0 1 1 1 1 1 1 
 
 1 0 0 1 0 0 0 
 
 0 1 1 1 0 1 1 
 
 1 0 0 1 0 1 1 .
I = 
 0 1 1 1 1 1 0 
 
 0 1 0 0 0 0 1 
 
 0 1 0 1 1 1 0 
1 0 0 0 0 0 1

1. Let V = {m2 , m4 , m6 }. Then, discuss why each of the following biclus-


ters is not minimal.
(a) V and O1 = {1, 3, 5},
(b) V and O2 = {3, 5, 7},
(c) V and O3 = {1, 3, 7},
(d) V and O4 = {1, 5, 7},
2. Why is the bicluster V , O5 = {3, 7} minimal?
Hint: 1. For each Oi , is the corresponding V set the same? 2. Fixing O5 ,
what is the corresponding set of variables? Is it the same as V ?

Exercise 180 (Set intersections to biclusters) Discuss how the problem


of detecting biclusters from a discrete matrix M , given a quorum K, can be
solved using the maximal set intersection problem of Chapter 13.

Exercise 181 For a partial order B defined on |F | > 1, show that

|P rm(F )|

is an even number.
Hint: Note that |P rm(F )| = |F |!
Yet another argument is by noticing that for |F | > 1, a permutation q 6= q.

Exercise 182 (On transitive reduction) Show that the following two state-
ments are equivalent.
Let B be a partial order defined on F .
Expression & Partial Order Motifs 499

1. Let G(F, E) be the graph representation (DAG) of B. i.e., if

m1  m2

in B, then there is a path from m1 to m2 in E. If E is the smallest


set of edges satisfying this condition then G(F, E) is called a transitive
reduction of B.

2. If G(F, E) is the transitive reduction of B, then for any pair,

m1 , m2 ∈ F,

if there is a directed path of length larger than 1 from m1 to m2 in


G(F, E), then edge
(m1 m2 ) 6∈ E.

Hint: Use proof by contradiction.

Exercise 183 (Existence and uniqueness of partial order DAG) Prove


that the solution to Problem 27 always exists and the constructed DAG is
unique.

Hint: For any pair m1 , m2 ∈ F , an edge is introduced between them if m1


precedes m2 in all the input sequences in. Next construct the unique transitive
reduction.

Exercise 184 (Edge in partial order DAG) Let

G(F, E)

be the solution to Problem 27. Show that

(m1 m2 ) ∈ E,

if and only if for each qi , the following two conditions hold:

1.
 
m1 precedes m and
m∈F = ∅.
m precedes m2 , in each qi

and

2. m1 precedes m2 .
500 Pattern Discovery in Bioinformatics: Theory & Algorithms

Hint: Use proof by contradiction.

Exercise 185 (Partial order algorithm) Consider the algorithm described


in Section (14.3.2) to construct the partial reduction of a DAG given a set of
sequences.

1. Identify the edge sets E ′ and E ′′ at each step in the example shown in
Figure 14.3.

2. Prove that the algorithm is correct.

3. What is the time complexity of the algorithm?

Hint: 2. Use Theorem (14.1).

Exercise 186 (Complete extensions) Enumerate the sets Cex(B) and


Cex(B) for the partial orders shown below.

2 1 2 1
S E E S
3 4 3 4
Partial order B Partial order B

Exercise 187 (Oranges & apples problem) In the following, qi ’s are


permutations such that

Π(qi ) ∩ Π(qj ) = ∅, for i 6= j,

and ni = |qi |, for each i.

1. Let
S2 = {q | q = q1 ⊕ q2 }.
Show that
   
n1 + n2 n1 + n2
|S2 | = = .
n1 n2

2. Let
Sr = {q | q = q1 ⊕ q2 ⊕ . . . ⊕ qr }.
Show that
    
n1 + n2 n1 + n2 + n3 n1 + n2 + .. + nr
|Sr | = ... .
n2 n3 nr
Expression & Partial Order Motifs 501

Hint: 1. This is the problem of arranging n1 oranges and n2 apples along


a line. Then out of the n1 + n2 slots, in how many ways can n1 positions
be picked for the oranges? The rest of the slots will be filled by the apples.
2. Now the orange is further categorized as a pumelo or a tangerine or a
mandarin and so on. The arrangement of these varieties in the orange slots
were determined in the previous step.

Exercise 188 (Excess)


1. Discuss how (recursive) PQ structures may be used for reducing excess
in a partial order DAG.
2. What are the issues involved in using multiple DAGs to represent some
m permutations.
Hint: 1. Use the following PQ tree for the example of Figure 14.4.
a c
S1 E1
b

S2 d S3 f
E3
a b c d e fg e E2 g
2. How many distinct DAGs? m ? How does excess reduce with the increase
in number of DAGs? What criterion to use?

Exercise 189 (Unique degenerate) Show that the only nonempty partial
order that is degenerate (Cex(B) = Cex(B)) is of the following form:
1
2
S E
3
4

Exercise 190 (Complete extensions) Let B be a nondegenerate partial


order defined on F . Then show the following.
1. If q ∈ Cex(B), then q 6∈ Cex(B).
2. The converse of the last statement is not true, i.e., there may exist
q ∈ P rm(F ) such that q 6∈ Cex(B) and q 6∈ Cex(B).
3. If q ∈ Cex(B), then q ∈ Cex(B).
4.
|Cex(B)| = |Cex(B)|,
|P rm(F )|
|Cex(B)| ≤ .
2
502 Pattern Discovery in Bioinformatics: Theory & Algorithms

Hint: 1. & 3. Since B is nondegenerate there is at least one ordering of a


pair of elements v1 and v2 in the DAG. If q ∈ Cex(B), then this ordering
is honored in q, thus cannot belong to Cex(B). 2. Pick an example from
Figure 14.9 to demonstrate this fact. 4. Follows from 3.

Exercise 191 (Isomorphic partial order)


1. If the DAG of a partial order B is isomorphic to the DAG of a partial
order B ′ , then show that

|Cex(B)| = |Cex(B ′ )|.

2. For a nondegenerate partial order B, if the DAG of B is isomorphic to


the DAG of B, then show that
n!
|Cex(B)| = |Cex(B)| = .
2

Hint: 1. The nodes can be simply relabeled to make the DAGs identical. 2.
Follows from 1.

Exercise 192 (Complete extensions) Let B be a nondegenerate partial


order.

1. Under what conditions does the following hold?


Y
|Cex(B)| = outDeg(v, B)!.
v6=E

2. Is the following true? Why?


 
Y Y n! 
|Cex(B)| ≤ min  outDeg(v, B)!, inDeg(v, B)!, .
2
v6=E v6=S
References

[AALS03] A. Amir, A. Apostolico, G. M. Landau, and G. Satta. Efficient


text fingerprinting via parikh mapping. Journal of Discrete Al-
gorithms, 1(5-6):409 – 421, 2003.
[ACP03] M. Alexandersson, S. Cawley, and L. Pachter. SLAM—cross-
species gene finding and alignment with a generalized pair hid-
den Markov model. 13:496–502, 2003.
[ACP05] A. Apostolico, M. Comin, and L. Parida. Conservative extrac-
tion of over-represented extensible motifs. ISMB (Supplement
of Bioinformatics), 21:9–18, 2005.
[ACP07] Alberto Apostolico, Matteo Comin, and Laxmi Parida. Varun:
Discovering extensible motifs under saturation constraints.
Preprint, 2007.
[AGK+ 04] W. Ao, J. Gaudet, W.J. Kent, S. Muttumu, and S.E Mango.
Environmentally induced foregut remodeling by pha-4/foxa and
daf-12/nhr. Science, 305(5691):1743–1746, 2004.
[AIL+ 88] A. Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and
U. Vishkin. Parallel construction of a suffix tree with applica-
tions. Algorithmica, 3:347–365, 1988.
[AMS97] T.S. Anantharaman, B. Mishra, and D.C. Schwartz. Genomics
via optical mapping II: Ordered restriction maps. Journal of
Computational Biology, 4(2):91–118, 1997.
[AP04] A. Apostolico and L. Parida. Incremental paradigms for motif
discovery. Journal of Computational Biology, 11(4):15–25, 2004.
[B0̈4] S. Böcker. Sequencing from compomers: Using mass spectrom-
etry for DNA de novo sequencing of 200+ nt. Journal of Com-
putational Biology, 11(6):1110–1134, 2004.
[BBE+ 85] A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussler, M.T. Chen,
and J. Seiferas. The smallest automaton recognizing the sub-
words of a text. Theoretical Computer Science, pages 31–55,
1985.
[BCL+ 01] John W.S. Brown, Gillian P. Clark, David J. Leader, Craig G.
Simpson, and Todd Lowe. Multiple snoRNA gene clusters from
arabidopsis. RNA, 7:1817–1832, 2001.

503
504 References

[BE94] T. L. Bailey and C. Elkan. Fitting a mixture model by ex-


pectation maximization to discover motifs in biopolymers. In
Proceedings of the Second International Conference on Intelli-
gent Systems for Molecular Biology, pages 28–36. AAAI Press,
1994.

[BGH03] N.E. Buchler, U. Gerland, and T. Hwa. On Schemes of Combi-


natorial Transcription Logic. PNAS, 100(9):5136–5141, 2003.

[BHM87] S. K. Bryan, M. E. Hagensee, and R. E. Moses. DNA poly-


merase III requirement for repair of DNA damage caused by
methyl methanesulfonate and hydrogen peroxide. In Journal of
Bacteriology, volume 16, pages 4608–4613. ACM Press, 1987.

[BL76] K. Booth and G. Leukar. Testing for the consecutive ones prop-
erty, interval graphs, and graph planarity using PQ-tree algo-
rithms. Journal of Computer and System Sciences, 13:335–379,
1976.

[BLZ93] B. Balasubramanian, C. V. Lowry, and R. S. Zitomer. The


Rox1 repressor of the saccharomyces cerevisiae hypoxic genes
is a specific DNA-binding protein with a high-mobility-group
motif. Mol Cell Biol, 13(10):6071–6178, 1993.

[BMRS02] M. -P. Bal, F. Mignosi, A. Restivo, and M. Sciortino. Forbidden


words in symbolic dynamics. Advances in Applied Mathematics,
25:163–193, 2002.

[BMRY04] K.H. Burns, M.M. Matzuk, A. Roy, and W. Yan. Tektin3


encodes an evolutionarily conserved putative testicular micro
tubules-related protein expressed preferentially in male germ
cells. In Molecular Reproduction and Development, volume 67,
pages 295–302. ACM Press, 2004.

[Bsh95] N.H. Bshouty. Exact Learning Boolean Functions via the Mono-
tone Theory. Information and Computation, Vol. 123(1):146–
153, 1995.

[BT02] Jeremy Buhler and Martin Tompa. Finding motifs using random
projections. In Journal of Computational Biology, volume 9(2),
pages 225—242, 2002.

[BW91a] G. Brightwell and P. Winkler. Counting linear extensions. In


Order, pages 225–242, 1991.

[BW91b] G. Brightwell and P. Winkler. Counting linear extensions is


#P-complete. In Proc. 23rd ACM Symposium on the Theory of
Computing (STOC), pages 175–181, 1991.
References 505

[Chi05] D. Y.-H. Chiang. Computational and Experimental Analyses


of Promoter Architecture in Yeasts. PhD thesis, University of
California, Berkeley, Spring 2005.
[CJI+ 98] W. Cai, J. Jing, B. Irvine, L. Ohler, E. Rose, H. Shizua,
U. J. Kim, M. Simon, T. Anantharaman, B. Mishra, and
D. C. Schwartz. High-resolution restriction maps of bacte-
rial artificial chromosomes constructed by optical mapping.
Proc. Natl. Acad. Sci. USA, 95:3390–3395, 1998.
[CLR90] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to
Algorithms. The MIT Press, Cambridge, Massachusetts, 1990.
[CMHK05] Joseph F. Contrera, Philip MacLaughlin, Lowell H. Hall, and
Lemont B. Kier. QSAR modeling of carcinogenic risk using dis-
criminant analysis and topological molecular descriptors. Cur-
rent Drug Discovery Technologies, Vol. 2(2):55–67, 2005.
[CP04] A. Chattaraj and L. Parida. An inexact suffix tree based al-
gorithm for extensible pattern discovery. Theoretical Computer
Science, (1):3–14, 2004.
[CP07] Matteo Comin and Laxmi Parida. Subtle motif discovery for
detection of DNA regulatory sites. Asia Pacific Bioinformatics
Conference, pages 95–104, 2007.
[CR04] F. Coulon and M. Raffinot. Fast algorithms for identifying max-
imal common connected sets of interval graphs. Proceedings
of Algorithms and Computational Methods for Biochemical and
Evolutionary Networks (CompBioNets), 2004.
[DBNW03] O. Dror, H. Benyamini, R. Nussinov, and H. J. Wolfson. Mul-
tiple structural alignment by secondary structures: Algorithm
and applications. Protein Science, 12(11):2492–2507, 2003.
[DFK] M. Dyer, A. Frieze, and R. Kannan. A random polynomial-time
algorithm for approximating the volume of convex bodies. In
Journal of the ACM (JACM), number 1.
[DH73] R.O. Duda and P. E. Hart. Pattern Classification and Scene
Analysis. John Wiley & Sons, Menlo Park, California, 1973.
[Did03] G. Didier. Common intervals of two sequences. In Proc. of the
Third Wrkshp. on Algorithms in Bioinformatics, volume 2812 of
Lecture Notes in Bioinformatics, pages 17–24. Springer-Verlag,
2003.
[DS03] Dannie Durand and David Sankoff. Tests for gene clustering.
Journal of Computational Biology, 10(3-4):453–482, 2003.
506 References

[DSHB98] T. Dandekar, B. Snel, M. Huynen, and P. Bork. Conservation


of gene order: a fingerprint of proteins that physically interact.
Trends Biochem. Sci., 23:324–328, 1998.
[ELP03] Revital Eres, Gad M. Landau, and Laxmi Parida. A combina-
torial approach to automatic discovery of cluster-patterns. In
Proc. of WABI, September 15-20, 2003.
[EP02] Eleazar Eskin and Pavel Pevzner. Finding composite regulatory
patterns in DNA sequences. In Bioinformatics, volume 18, pages
354–363, 2002.
[Fel68] William Feller. An Introduction to Probability Theory and its
Applications. Wiley, 1968.
[Fis87] D.H. Fisher. Knowledge Acquisition via Incremental Concep-
tual Clustering. Machine Learning, Vol. 2(2):139–172, 1987.
[FRP+ 99] Aris Floratos, Isidore Rigoutsos, Laxmi Parida, Gustavo
Stolovitzky, and Yuan Gao. Sequence homology detection
through large-scale pattern discovery. In Proceedings of the
Annual Conference on Computational Molecular Biology (RE-
COMB99), pages 209–215. ACM Press, 1999.
[GBM+ 01] S. Giglio, K. W. Broman, N. Matsumoto, V. Calvari, G. Gimelli,
T. Neuman, H. Obashi, L. Voullaire, D. Larizza, R. Giorda, J. L.
Weber, D. H. Ledbetter, and O. Zuffardi. Olfactory receptor-
gene clusters, genomic-inversion polymorphisms, and common
chromosme rearrangements. Am. J. Hum. Genet., 68(4):874–
883, 2001.
[GJ79] M.R. Garey and D.S. Johnson. Computers and Intractability: A
Guide to the Theory of NP-Completeness. W.H. Freeman and
Co., San Francisco, 1979.
[GPS03] H. H. Gan, S. Pasquali, and T. Schlick. Exploring the repertoire
of RNA secondary motifs using graph theory: implications for
RNA design. Nucleic Acids Research, 31(11):2926–2943, 2003.
[Hea04] C.T. Harbison et al. Transcriptional Regulatory Code of a Eu-
karyotic Genome. Nature, 431:99–104, Sep 2004.
[HETC00] J. D. Hughes, P. W. Estep, S. Tavazoie, and G. M. Church.
Computational identification of cis-regulatory elements associ-
ated with groups of functinally related genes in Saccharomyces
cerevisiae. J Molec. Bio., 296:1205–1214, 2000.
[HG04] X. He and M.H. Goldwasser. Identifying conserved gene clus-
ters in the presence of orthologous groups. In Proceedings of the
Annual Conference on Computational Molecular Biology (RE-
COMBo4), pages 272–280. ACM Press, 2004.
References 507

[HS99] G. Z. Hertz and G. D. Stormo. Identifying DNA and protein


patterns with statistically significant alignments of multiple se-
quences. Bioinformatics, 15:563–577, 1999.
[HSD05] Rose Hoberman, David Sankoff, and Dannie Durand. The sta-
tistical significance of max-gap clusters. Lecture Notes in Com-
puter Science, 3388/2005, 2005.
[ID05] S. Istrail and E.H. Davidson. Logic functions of the genomic
cis-regulatory code. PNAS, 102(14):4954–4959, Apr 2005.
[IWM03] A. Inokuchi, T. Washio, and H. Motoda. Complete mining of
frequent patterns from graphs: Mining graph data. Machine
Learning, 50(3):321–354, 2003.
[KK00] D. Kihara and M. Kanehisa. Tandem clusters of membrane pro-
teins in complete genome sequences. Genome Research, 10:731–
743, 2000.
[KLP96] Z. M. Kedem, G. M. Landau, and K. V. Palem. Parallel suffix-
prefix matching algorithm and application. SIAM Journal of
Computing, 25(5):998–1023, 1996.
[KMR72] R. Karp, R. Miller, and A. Rosenberg. Rapid identification of
repeated patterns in strngs, arrays and trees. In Symposium on
Theory of Computing, volume 4, pages 125–136, 1972.
[KP02a] Keich and Pevzner. Finding motifs in the twilight zone. In
Annual International Conference on Computational Molecular
Biology, pages 195–204, Apr, 2002.
[KP02b] Uri Keich and Pavel Pevzner. Subtle motifs: defining the limits
of motif finding algorithms. In Bioinformatics, volume 18, pages
1382–1390, 2002.
[KPE+ 03] M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E. S. Lan-
der. Sequencing and comparison of yeast species to identify
genes and regulatory elements. Nature, 423:241–254, May 2003.
[KPL06] Md Enamul Karim, Laxmi Parida, and Arun Lakhotia. Using
permutation patterns for content-based phylogeny. In Pattern
Recognition in Bioinformatics, volume 4146 of Lecture Notes in
Bioinformatics, pages 115–125. Springer-Verlag, 2006.
[LAB+ 93] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F.
Neuwald, and J. C. Wootton. Detecting subtle sequence signals:
A Gibbs sampling strategy for multiple alignment. Science,
262:208–214, Oct, 1993.
[LD05] M. Levine and E.H. Davidson. Gene regulatory networks for
development. PNAS, 102(4):4936–4942, Apr 2005.
508 References

[LKWP05] Arun Lakhotia, Md Enamul Karim, Andrew Walenstein, and


Laxmi Parida. Malware phylogeny using maximal π-patterns.
In EICAR, 2005.
[LMF03] A. V. Lukashin, M.E.Lakashev, and R. Fuchs. Topology of gene
expression networks as revealed by data mining and modeling.
Bioinformatics, 19(15):1909–1916, 2003.
[LMS96] M. Y. Leung, G. M. Marsh, and T. P. Speed. Over and un-
derrepresentation of short DNA words in herpesvirus genomes.
Journal of Computational Biology, 3:345–360, 1996.
[LPW05] Gad Landau, Laxmi Parida, and Oren Weimann. Using PQ
trees for comparative genomics. In Proc. of the Symp. on Comp.
Pattern Matching, volume 3537 of Lecture Notes in Computer
Science, pages 128–143. Springer-Verlag, 2005.
[LR90] C. E. Lawrence and A. A. Reilly. An expectaion maximiza-
tion (EM) algorithm for the identification and characterization
of common sites in unaligned biopolymer sequences. Proteins:
Structure, Function and Genetics, 7:41–51, 1990.
[LR96] J. G. Lawrence and J. R. Roth. Selfish operons: Horizontal
transfer may drive the evolution of gene clusters. Genetics,
143:1843–1860, 1996.
[LW03] H. Li and W. Wang. Dissecting the transcription networks of a
cell using computational genomics. Current Opinion in Genetics
and Development, 13:611–616, 2003.
[MBV05] Aurlien Mazurie, Samuel Bottani, and Massimo Vergassola. An
evolutionary and functional assessment of regulatory network
motifs. Genome Biolgy, 6(4), 2005.
[MCM81] M. Habib M. Chein and M.C Maurer. Partitive hypergraphs.
Discrete Mathematics, 37:35–50, 1981.
[Mic80] R.S. Michalski. Knowledge acquisition through conceptual clus-
tering: A theoretical framework and algorithm for partitioning
data into conjunctive concepts. International Journal of Policy
Analysis and Information Systems, Vol. 4:219–243, 1980.
[MM00] H. Mannila and C. Meek. Global Partial Orders from Sequential
Data. In Proc. KDD’00, pages 161–168, 2000.
[MO04] S.C. Madeira and A.L. Oliveira. Biclustering algorithms for
biological data analysis: A survey. IEEE/ACM TCBB, 1:24–
45, 2004.
[MPN+ 99] E. M. Marcott, M. Pellegrini, H. L. Ng, D. W. Rice, T. O.
Yeates, and D. Eisenberg. Detecting protein function and
protein-protein interactions. Science, 285:751–753, 1999.
References 509

[MSOI+ 02] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii,


and U. Alon. Network motifs: Simple building blocks of complex
networks. Science, 298:824–827, 2002.
[Mur03] T. Murata. Graph mining approaches for the discovery of web
communities. Proceedings of the International Workshop on
Mining Graphs, Trees and Sequences, pages 79–82, 2003.
[OFD+ 99] R. Overbeek, M. Fonstein, M. Dsouza, G. D. Pusch, and
N. Maltsev. The use of gene clusters to infer functional cou-
pling. Proc. Natl. Acad. Sci. USA, 96(6):2896–2901, 1999.
[OFG00] H. Ogata, W. Fujibuchi, and S. Goto. A heuristic graph compar-
ison algorithm and its application to detect functionally related
enzyme clusters. Nucleic Acids Res, 28:4021–4028, 2000.
[Par66] R. J. Parikh. On context-free languages. J. Assoc. Comp.
Mach., 13:570–581, 1966.
[Par98] L. Parida. A uniform framework for ordered restriction map
problems. Journal of Computational Biology, 5(4):725–739,
1998.
[Par99] L. Parida. On the approximability of physical map problems
using single molecule methods. Procceedings of Discrete Mathe-
matics and Theoretical Computer Science (DMTCS 99), pages
310–328, 1999.
[Par06] L. Parida. A PQ framework for reconstructions of common an-
cestors & phylogeny. In RECOMB Satellite Workshop on Com-
parative Genomics, LNBI, volume 4205, pages 141–155, 2006.
[Par07a] Laxmi Parida. Discovering topological motifs using a compact
notation. Journal of Computational Biology, 14(3):46–69, 2007.
[Par07b] Laxmi Parida. Gapped permutation pattern discovery for gene
order comparisons. 14(1):46–56, 2007.
[PBCB99] A.G. Pederson, P. Baldi, Y. Chauvin, and S. Brunak. The
biology of eukaryotic promoter prediction: A review. Computers
and Chemistry, 23:191–207, 1999.
[PCGS05] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot. Bases
of motifs for generating repeated patterns with wild cards.
IEEE/ACM Transaction on Computational Biology and Bioin-
formatics, 2(1):40–50, 2005.
[PG99] L. Parida and D. Geiger. Mass estimation of DNA molecules
& extraction of ordered restriction maps in optical mapping
imagery. Algorithmica, (2/3):295–310, 1999.
510 References

[PR05] L. Parida and N. Ramakrishnan. Redescription Mining: Struc-


ture Theory and Algorithms. In Proc. AAAI’05, pages 837–844,
July 2005.
[PRF+ 00] L. Parida, I. Rigoutsos, A. Floratos, D. Platt, and Y. Gao.
Pattern discovery on character sets and real-valued data: Lin-
ear bound on irredundant motifs and an efficient polynomial
time algorithm. In Proceedings of the eleventh ACM-SIAM Sym-
posium on Discrete Algorithms (SODA), pages 297–308. ACM
Press, 2000.
[PRP03] Alkes Price, Sriram Ramabhadran, and Pavel Pevzner. Finding
subtle motifs by branching from sample strings. In Bioinfor-
matics, number 1, pages 149–155, 2003.
[Prü18] H. Prüfer. Neuer beweis eines satzes über permutationen. Arch.
Math. Phys, 27:742–744, 1918.
[PS00] P. A. Pevzner and S.-H. Sze. Combinatorial approaches to
finding subtle signals in DNA sequences. In Proceedings of
the Eighth International Conference on Intelligent Systems for
Molecular Biology, pages 269–278. AAAI Press, 2000.
[PSC01] Y. Pilpel, P. Sudarsanam, and G.M. Church. Identifying regula-
tory networks by combinatorial analysis of promoter elements.
Nature Genetics, 29:153–159, Oct 2001.
[PZ05] L. Parida and R. Zhou. Combinatorial pattern discovery ap-
proach for the folding trajectory analysis of a β-hairpin. PLoS
Computational Biology, 1(1), 2005.
[RBH05] S. Rajasekaran, S. Balla, and C.-H Huang. Exact algorithms
for planted motif problems. Journal of Computational Biology,
12(8):1117–1128, 2005.
[RKM+ 04] N. Ramakrishnan, D. Kumar, B. Mishra, M. Potts, and R.F.
Helm. Turning CARTwheels: An Alternating Algorithm for
Mining Redescriptions. In Proc. KDD’04, pages 266–275, Aug
2004.
[RP07] Naren Ramakrishnan and Laxmi Parida. Modeling the combi-
natorial control of transcription using partial order motifs and
their redescriptions. Preprint, 2007.
[RSW04] A. Reményi, H.R. Schöler, and M. Wilmanns. Combinatorial
control of gene expression. Nature Structural and Molecular
Biology, 11(4):812–815, Sep 2004.
[Sag98] M. F. Sagot. Spelling approximate repeated or common motifs
using a suffix tree. Latin 98: Theoretical Informatics, Lecture
Notes in Computer Science, 1380:111–127, 1998.
References 511

[SBH+ 01] I. Simon, J. Barnett, N. Hannett, C. T. Harbison, N. J. Rinaldi,


T. L. Volkert, J. J. Wyrick, J. Zeitlinger, D. K. Gifford, T. S.
Jaakkola, and R. A. Young. Serial regulation of transcriptional
regulators in the yeast cell cycle. Cell, 106:697–708, 2001.

[SCH+ 97] A. Samad, W. W. Cai, X. Hu, B. Irvin, J. Jing, J. Reed,


X. Meng, J. Huang, E. Huff, B. Porter, A. Shenker, T. Anan-
tharaman, B. Mishra, V. Clarke, E. Dimalanta, J. Edington,
C. Hiort, R. Rabbah, J. Skiadas, and D. Schwartz. Mapping
the genome one molecule at a time – optical mapping. Nature,
378:516–517, 1997.

[SLBH00] B. Snel, G Lehmann, P Bork, and M A Huynen. A web-server


to retrieve and display repeatedly occurring neighbourhood of
a gene. Nucleic Acids Research, 28(18):3443–3444, 2000.

[SMA+ 97] J. L. Siefert, K. A. Martin, F. Abdi, W. R. Widger, and G. E.


Fox. Conserved gene clusters in bacterial genomes provide fur-
ther support for the primacy of RNA. J. Mol. Evol., 45:467–472,
1997.

[SOMMA02] S.S. Shen-Orr, R. Milo, S. Mangan, and U. Alon. Network


motifs in the transcriptional regulation network of Escherichia
coli. Nature Genetics, 31:64–68, 2002.

[SS04] T. Schmidt and J. Stoye. Quadratic time algorithms for finding


common intervals in two and more sequences. CPM, LNCS
3109:347–358, 2004.

[SS05] E. Segal and R. Sharan. A discriminative model for identifying


spatial cis-regulatory modules. JCB, 12(6):822–834, 2005.

[SSDZ05] A.D. Smith, P. Sumazin, D. Das, and M.Q. Zhang. Mining chip-
chip data for transcription factor and cofactor binding sites.
Bioinformatics, 21 (Suppl1):403–412, 2005.

[SSR+ 03] E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller,


and N. Friedman. Module networks: Identifying regulatory
modules and their condition-specific regulators from gene ex-
pression data. Nature Genetics, 34(2):166–176, June 2003.

[Sto88] G. D. Stormo. Computer methods for analyzing sequence recog-


nition of nucleic acids. Annual Review of Biophysics and Bio-
physical Chemistry, 17:241–263, 1988.

[TCOV97] J. Tamames, G. Casari, C. Ouzounis, and A. Valencia. Con-


served clusters of functionally related genes in two bacterial
genomes. J. Mol. Evol., 44:66–73, 1997.
512 References

[TK98] K. Tomii and M. Kanehisa. A comparative analysis of ABC


transporters in complete microbial genomes. Genome Res,
8:1048–1059, 1998.
[TKT+ 94] S. Taguchi, S. Kojima, M. Terabe, K. I. Miura, and H. Momose.
Comparative studies on primary structures and inhibitory prop-
erties of subtilisintrypsin inhibitors from streptomyces. Eur J.
Biochem., 220:911–918, 1994.
[TLB+ 05] Martin Tompa, Nan Li, Timothy L. Bailey, George M. Church,
Bart De Moor, Eleazar Eskin, Alexander V. Favorov, Mar-
tin C. Frith, Yutao Fu, W. James Kent, Vsevolod J. Makeev,
Andrei A. Mironov, William Stafford Noble1, Giulio Pavesi,
Graziano Pesole, Mireille Rgnier, Nicolas Simonis, Saurabh
Sinha, Gert Thijs, Jacques van Helden, Mathias Vandenbogaert,
Zhiping Weng, Christopher Workman, Chun Ye, and Zhou Zhu.
Assessing computational tools for the discovery of transcription
factor binding sites. Nature Biotechnology, 23:137–144, 2005.
[UY00] T. Uno and M. Yagiura. Fast algorithms to enumerate all com-
mon intervals of two permutations. Algorithmica, 26(2):290–
309, 2000.
[VCP+ 95] A. Volbeda, M. H. Charon, C. Piras, E. C. Hatchikian, M. Frey,
and J. C. Fontecilla-Camps. Crystal structure of the nickel-
iron hydrogenase from desulfovibrio gigas. Nature, 373:580–587,
1995.
[Vit67] Andrew J. Viterbi. Error bounds for convolutional codes and an
asymptotically optimum decoding algorithm. In IEEE Trans-
actions on Information Theory, volume 13(2), pages 260–269,
1967.

[VPPP00] R.E. Valdes-Perez, V. Pericliev, and F. Pereira. Concise, intel-


ligible, and approximate profiling of multiple classes. Interna-
tional Journal of Human-Computer Studies, Vol. 53(3):411–436,
2000.
[WAG84] M.S. Waterman, R. Aratia, and D.J. Galas. Pattern recogni-
tion in several sequences: Consensus and alignment. Bulletin of
Mathematical Biology, 46(4):515–527, 1984.
[Wat95] M.S. Waterman. An Introduction to Computational Biology:
Maps, Sequences and Genomes. Chapman Hall, 1995.
[WMIG97] H. Watanabe, H. Mori, T. Itoh, and T. Gojobori. Genome
plasticity as a paradigm of eubacteria evolution. J. Mol. Evol.,
44:S57–S64, 1997.
References 513

[WOB03] S. Wuchty, Z.N. Oltvai, and A-L Barabasi. Evolutionary con-


servation of motif constituents in the yeast protein interactin
networks. Nature Genetics, 35(2):176–179, 2003.
[Zak04] M.J. Zaki. Mining non-redundant association rules. DMKD,
9(3):223–248, 2004.
[ZKW+ 05] Lan V Zhang, Oliver D King, Sharyl L Wong, Debra S Goldberg,
Amy HY Tong, Guillaume Lesage, Brenda Andrews, Howard
Bussey, Charles Boone, and Frederick P Roth. Motifs, themes
and thematic maps of an integrated saccharomyces cerevisiae
interaction network. Journal of Biology, Vol. 4(2), 2005.
[ZPKM07] Ruhong Zhou, Laxmi Parida, Kush Kapila, and Sudhir Mudur.
PROTERAN: Animated terrain evolution for visual analysis of
patterns in protein folding trajectory. Bioinformatics, 23(1):99–
106, 2007.
[ZZ99] J. Zhu and M.Q. Zhang. SCPD: A promoter database of the
yeast saccharomyces cerevisiae. Bioinformatics, 15:607–611,
1999.
Index

P -arrangement, 330 minimal set intersection, 448,


counts, 331 463
lemma, 332, 333 Minimum Mutation Labeling,
lower bound, 341 26
theorem, 335 Minimum Spanning Tree, 19
upper bound, 336 monotone expression extraction,
ΦX , 77 477
χ-square, 200 ordered enumeration trie of multi-
πpattern, 266 sets, 451, 464
σ-algebra, 49 partial order, 500
LU , 373 partial order construction, 482
LcU , 373 pattern discovery, 183
Lp , 267 density constraint, 207
Lp , 90, 141, 142 PQ tree construction, 301
Prim’s, 18
algorithm
Projections, 255
approximation, 41
RC intervals extraction, 282
bicluster extraction, 479
SP-Star, 261
Compact Motifs Discovery, 399
Constructing trees from Prüfer straddle graph construction, 422
Sequences, 37 strongly connected, 125
decoder, 130 time
DFS traversal, 125 exponential, 21, 40
Enumerating Unrooted Trees, polynomial, 40
36 Viterbi, 130, 132
Fitch’s, 23 Winnower, 252
generalized suffix tree, 205 alignment, 141
incremental discovery, 414 local (genome), 319
intervals extraction naive, 281 alphabet
irreducible interval extraction, discrete, 140, 168
297 reals, 167, 174, 175
Kosaraju, 125 size, 232
LIFO operations, 282 anti-pattern, 108
maximal string intersection, 466 antisymmetric relation, 480
maximality check, 207 asymptotic, 31
MaxMin Path, 44 autocorrelation, 161
microarray pattern extraction, automatic monotone expression dis-
479 covery, 474

515
516 Pattern Discovery in Bioinformatics: Theory & Algorithms

axiom variable gap, 164


alternative, 82 wild, 140, 149
Kolmogorov, 48 characteristic function, 77
Axiom of choice, 49 Chebyshev’s Inequality Theorem, 71
clique, 252, 378
B Subtilis-E Coli data, 309 maximal, 382
basis, 96, 160, 178 COG, 309
construction, 178 combinatoric, 345
string pattern, 180 explosion, 371
Bayes’ Theorem, 225 topologies, 357
Bayes’ theorem, 51 common connected component, 319
Bernoulli Scheme, 120 compact
Bernoulli Trial edge, 373, 384
multi-outcome, 55 list LcU , 373
bicluster, 102, 475, 478, 498 location list, 373
constant columns, 475 motif, 373, 384
extraction algorithm, 479 notation, 374
bijection, 358 topological motif, 369
Binomial distribution, 61 trie, 144
biopolymer, 113 vertex, 373, 384
Bonferroni’s Inequalities, 53 compact list
Boole’s Inequality, 53 LcU , 373
boolean closure, 423 characteristic
boolean expression, 471 attribute, 378
conjunctive normal form (CNF), clique, 378
474 discriminant, 378
disjuctive normal form (DNF), expansion, 378
474 flat set, 378
monotone, 474 flat, 410
motif, 102, 104, 471 operation
=c , 380
C-values, 89 Enrich(·), 381
cell, 184, 192, 203 ∩c , 380
size, 185 ∪c , 380
Central limit theorem, 78 \c , 381
central tendency, 72 maxClk(.), 382
challenge problem, 238 operation (binary)
character difference, 381
‘-’, 164 intersection, 380
‘.’, 140, 149 union, 380
dash, 164 operation (unary)
dont care, 140, 149 enrich-subset, 381
dot, 140, 149 maximal-clique-subset, 382
annotation, 194 relation (binary)
solid, 140 complete intersection, 384
Index 517

complete subset, 384 continuous


consistent, 384 normal, 64, 199
inconsistent, 384 cumulative, 60
compatible, 480 discrete
complete extensions, 500–502 binomial, 61
compomer, 318 geometric, 209
conditional probability, 50 Poisson, 48, 62, 200
conjugate DNA repeats
compact list, 374 LINE, 114
compact vertex, 376 microsatellites, 114
relation, 377 SINE, 114
conjunctive normal form (CNF), 474 STR, 114
Consensus Motif Problem, 236 tandem, 114
constraint VNTR, 114
combinatorial, 99 Y-STR, 114
density, 156, 162 duality, 90
quorum, 157, 163, 360 dynamic programming, 339
statistical, 99
contained, 418 E Coli-B Subtilis data, 309
copy number, 207, 266, 450 edit
cumulative distribution function, 60 deletion, 239, 241
distance, 238
data indel, 257
E Coli-B Subtilis, 309 insertion, 239, 241
human-rat, 308 mutation, 239, 241
De Morgan’s laws, 496 eigen
Decoder value, 123
optimal subproblem Lemma, 130 vector, 123
problem, 130 emission
decomposition matrix, 128
theorem, 321 vector, 128
tree, 320 empty order, 459
degenerate, 501 enrich, 381
deletion, 239, 241 entropy, 85
density constraint, 156, 162 conditional, 86
depth first traversal, 125, 436 joint, 86
Dirichlet distribution, 228 enumeration, 246, 260
disjoint neighbor, 246
event, 346 submotif, 249
set, 418 equation
disjuctive normal form (DNF), 474 recurrence, 29, 32
distance error
edit, 238 false negative, 257
Hamming, 106, 238 Type I, 257
distribution Type II, 258
518 Pattern Discovery in Bioinformatics: Theory & Algorithms

estradiol, 359 asymptotic, 30


estrogen, 359 bijective, 38
estrone, 359 characteristic, 77
event, 49 injective, 38
atomic, 49 one-to-one, 38, 358, 400
disjoint, 346 onto, 38
elementary, 49 surjective, 38
independent, 51
multiple, 51 gene expression, 478
mutually exclusive, 51 General Consecutive Arrangement
union, 52 Problem, 270, 426
excess, 313, 483, 501 generalized suffix tree, 236
PQ, 483 Gibbs Sampling, 227
expectation, 58, 245 graph, 9, 355
moments, 58 acyclic, 13
properties chain, 429
inequality, 58 clique, 378
linearity, 59 combinatorics, 402
nonmultiplicative, 59 common connected component,
Expectation Maximization, 222 319
expression connected, 12
boolean, 471 connected component, 12
gene, 478 cycle, 13
partial order, 470 directed, 405
profiling, 478 strongly connected, 124
edge, 10
false positive, 258 forbidden structure, 462
finite state machine, 121 isomorphism, 358, 403
flat list, 410 labeled, 406
Flip Lemma, 476 local structures, 462
forbidden words, 108 mandatory structure, 462
formula meta-graph, 384
Bonferroni’s Inequalities, 53 consistent, 384
Boole’s Inequality, 53 consistent subgraph, 389
Cayley’s, 16, 38 inconsistent subgraph, 389
expected depth of execution tree, maximal connected consis-
45 tent subgraph, 391
Fibonacci, 33 MCCS, 391
multinomial, 325 node, 10
Stirling’s, 33, 43 path, 12
Vandermonde’s, 63 self-isomorphism, 358
frequentist, 90, 109, 392 transitive reduction, 480
frontier, 342, 427 tree, 13
number, 342 internal node, 13, 146
function leaf node, 13, 146
Index 519

root node, 146 joint probability, 50


vertex, 10 junk DNA, 114

Hamming distance, 106, 238 k-mer, 142


haplotype block, 106 Kolmogorov’s axioms, 48
Hidden Markov Model, 128
HMM, 128
l-mer, 142
homologous, 105, 140, 165
Law of large numbers, 75
human-rat data, 308
Lebesgue measurable, 49
lemma
i.i.d., 115, 191, 196, 198, 328
inclusion-exclusion principle, 53 P -arrangement principle, 332,
333
Indel Consensus Motif Problem, 257
independent and identically distributed, bridge, 17
115, 191, 196, 198, 328 Cayley’s, 38
independent event, 51 decoder optimal subproblem,
induced subgraph, 378 130
edges, 12 descendant, 420
vertices, 12 edge-vertex, 15
information enumeration-trie size, 436
content, 219 flip, 476
mutual, 86 frontier, 428
theory, 85 intersection conjugate, 382
insertion, 239, 241 interval property, 284
intersection closure, 423 mandatory leaf, 13
interval, 279, 330 maximal (permutation pattern),
contained, 299 268
disjoint, 299 monotone functions, 284
irreducible, 295 multi-tree partition, 24
nested, 299, 334, 335 reduced graph, 481
overlap, 299 straddle graph properties, 425
reducible, 295 subset conjugate, 383
Intervals sum of
Extraction Naive Algorithm, 281 binomial variables, 62
Problem, 279 normal variables, 65
inversion, 239, 330 Poisson variables, 62
irreducible interval, 295 random variables, 62
extraction, 297 symmetric, 489
theorem, 306 tree
irreducible matrix, 124 edges, 42
irreducible matrix theorem, 124, 126 leafnodes, 42
irredundant, 93, 109, 160, 165, 168 linear size, 42
string pattern, 180 trie partial-order, 449
isomorphism, 358 two-tree partition, 23
occurrence, 371 unique straddle graph, 422
520 Pattern Discovery in Bioinformatics: Theory & Algorithms

unique transitive reduction, 421, string pattern, 180


458 alternative, 211
weakest link, 18 composition, 151, 165
LIFO Operations, 282 extensible, 165
list extension, 165
relation length, 144, 151, 165
contained, 418 rigid, 151
disjoint, 418 solid, 144
nested, 418 subtle motif, 243
overlap, 419 topological motif, 367, 368
straddle, 418 edge-maximal, 368
local multiple alignment, 213 location list, 374
location list, 90, 141, 142 vertex-maximal, 368
long interspersed nuclear elements, maximal multi-set intersection prob-
114 lem, 450, 465
maximal set intersection problem,
marginal probability, 51 434
Markov maximal string intersection algorithm,
chain, 121 466
Inequality Theorem, 70 mean, 72
property, 121 measurable
mass spectrometer, 68 Lebesgue, 49
matrix space, 49
emission, 128 measure
incidence, 471 χ-square, 200
irreducible, 124 information content, 219
primitive, 127 log likelihood, 217
rank-one, 127 z-score, 199
reducible, 124 median, 72
scoring, 233 microarray, 478
BLOSUM, 233 pattern extraction, 479
Dayhoff, 233 microsatellites, 114
PAM, 233 minimal, 108
substitution, 233 Minimal Consensus PQ Tree Prob-
BLOSUM, 233 lem, 313
Dayhoff, 233 minimal set intersection problem,
PAM, 233 447
symmetric, 233 minimal vertex series-parallel, 489
Maxima, 417 mode, 72
maximal, 94, 357, 358 model selection, 108, 219
bicluster, 475 Monge
closed itemset, 475 array, 316
location list property, 316
compact, 374 monotone, 104, 471, 497
permutation pattern, 268, 312 expression, 474
Index 521

Monotone Expression Extraction, Cex(B), 481


477 MP , 54
Monotone functions Lemma, 284 O(e), 472
motif, 90 P=1 , 269
consensus, 235 P>1 , 271
embedded, 238 P rm(B), 481
network, 356 Ω, 49
planted, 238 Ω(·), 31
profile, 214 ΦX , 77
subtle, 235 Π, 330
topological, 356, 360 Π(·), 266
location list, 360 Π(e), 472
maximal, 359 Ψ, 265
nonmaximal, 359 Ψ, 274, 327
motif learning problem, 215 Θ(·), 31
multi-set, 450 Υ, 30
homologous, 165 O(·), 15, 31
vertices, 367 ω, 49
multinomial ω(·), 31
coefficient, 325 πpattern, 266
formula, 325 o(·), 31
multiple alignment asymptotic, 15
global, 213 big-oh, 15, 31
local, 213 graph, 355
multiplicity, 266 omega, 31
compact lists, 377 set difference \, 472
permutation pattern, 312 small-oh, 31
mutation, 239, 241 small-omega, 31
mutual information, 85 theta, 31
mutually exclusive event, 51 topological motif
M (VM , EM ), 374
nested, 333, 418 compact C(VC , EC ), 374
interval, 334, 335 null hypothesis, 69, 81
network motif, 356
neutral model, 69 Occam’s razor, 12
node occurrence, 90, 167
child, 419 extensible pattern, 164
parent, 419 degenerate, 197
sibling, 419 nondegenerate, 194
nodes homologous, 195
imm-compatible, 483 isomorphism, 371
nonmaximal, 94, 103, 104 multiple, 164, 178, 232
nontrivial, 105 pattern on reals, 167
Normal distribution, 64 permutation pattern, 266
notation probability, 192
522 Pattern Discovery in Bioinformatics: Theory & Algorithms

extensibility, 210 excess, 483


rigid pattern, 149 expression, 470
solid pattern, 142 graph, 419
topological motif, 360 incomparable, 460
distinct, 359 inverse, 485
operator motif, 102, 103, 480
⊕, 92, 185 nondegenerate, 486
⊗, 180 reduced, 420
⊗, 92, 141, 160, 161, 165 statistics, 485
compact list straddle graph, 421, 460
=c , 408 symmetric, 502
Enrich(·), 381 total, 429, 458
∩c , 408 transitive reduction, 420
∪c , 408 Partial Order Construction
⊂c , 408 Algorithm, 482
maxClk(·), 382 Problem, 481
equal, 408 partitive families, 320
intersection, 408 pattern, 90
subset, 408 consensus, 235
union, 408 duality, 90
meet, 141 extensible, 140, 164, 193, 258
oranges & apples problem, 500 degenerate, 196
ordered enumeration trie, 435, 463, nondegenerate, 194
464 nontrivial, 179
of multi-sets, 451, 464 realization, 164, 193
orthologous, 309 trivial, 179
genes, 265 forbidden, 108
output-sensitive, 437 haplogroup, 107
overlap, 419 nontrivial, 141
profile, 214
p-value, 80 recombination, 106
package rigid, 139, 149, 191, 258
math saturation, 151
maxima, 417 size, 140
pattern solid, 139, 142
Varun, 201, 259 subtle, 235
statistics topological, 356
R, 417 unique, 108
Pair Invariant Theorem, 481 pattern specification
Parikh vector, 265, 274, 327 nontrivial, 99
partial order, 480 well-defined, 91
algorithm, 500 patternist, 389
chain, 429 performance score, 245
DAG, 499 permutation
empty, 459, 485 interval, 279, 330
Index 523

pattern, 101, 266 NP-complete, 40, 400


structured, 329 NP-hard, 21
Permutation Pattern Discovery Prob- tractable, 40
lem, 273 Connected Graph with Mini-
Planted (l, d)-motif Problem, 237 mum Weight, 10
Poisson distribution, 48, 62 consensus motif, 236
posterior, 51 Counting Binary Trees, 34
PQ tree, 269, 329, 342, 426 decoding, 130
equivalent, 342 element retrieval, 29
frontier, 342, 427 Enumerating unrooted trees, 36
PQ Tree Construction Algorithm, general consecutive arrangement,
301 270, 426
Prüfer sequence, 36 graph connectedness, 410
principle Indel consensus motif, 257
Occam’s razor, 12, 14 intervals, 279
parsimony, 12 Largest Common Subgraph, 400
prior, 51 maximal clique, 400
probabilistic model maximal multi-set intersection,
Bernoulli Scheme, 120 450, 465
Bernoulli trial, 209 maximal set intersection, 434
Hidden Markov, 128 maximum clique, 400
HMM, 128 maximum independent set, 410
i.i.d., 115, 191, 196, 198, 328 Maximum Subgraph Matching,
Markov, 192, 196 400
Markov chain, 121 MaxMin Path, 44
probability minimal consensus PQ tree, 313
axioms, 54, 325 minimal set intersection, 447
conditional, 50 Minimum Mutation Labeling,
distribution, 50 22
joint, 50 Minimum Spanning Tree, 14,
marginal, 51 16
mass function conditions, 54, monotone expression discovery,
325 474
matrix, 215 motif learning, 215
measure, 49 oranges & apples, 500
posterior, 51 output-sensitive, 437
prior, 51 partial order construction, 481
space, 48, 324 permutation pattern discovery,
probe, 478 273
problems permutations with exact mul-
P -arrangement counts, 331 tiplicities, 273, 328
automated topological discov- permutations with inexact mul-
ery, 399 tiplicities, 328
class planted (l, d)-motif, 237
#P-complete, 489 Steiner Tree, 21
524 Pattern Discovery in Bioinformatics: Theory & Algorithms

Subgraph Isomorphism, 400 binary, 420


pseudo conjugate, 377
-random generator, 134 equivalent, 342
count, 227 transitive, 480
representation theorems of analy-
quorum constraint, 163 sis, 57
bicluster, 479 restriction
boolean expression motif, 474 enzyme, 208
permutation pattern, 267 site, 208, 209
string pattern, 157 reversal, 239
topological motif, 360 string pattern, 180

R, 417
Sample Mean & Variance Theorem,
Ramsey theory, 89
74
random
saturation, 151, 185, 187, 188, 258
permutation, 134, 346
score
string, 135
performance, 245
variable, 48, 57
binomial, 62 solution coverage, 246
exponential, 209 scoring matrix
Poisson, 64 BLOSUM, 233
product, 58 Dayhoff, 233
sum, 58 PAM, 233
rat-human data, 308 sequence
RC Intervals Extraction Algorithm, motif, 104
282 pattern, 102
real sequences, 167 set
realization, 164 boolean closure, 423, 460
recombination pattern, 106 contained, 299
redescription, 104, 493 disjoint, 299
reduced partial order, 420 intersection closure, 423
size, 458 nested, 299, 334, 335
Reduced-graph Lemma, 481 overlap, 299
reducible partial order graph, 419
interval, 295 partitive, 320
matrix, 124 relation
redundant, 93, 109, 160, 165, 176 contained, 418
string pattern, 180 disjoint, 418
relation nested, 333, 418
=δ , 91 overlap, 419
=r , 167 straddle, 418, 460
≺, 185 union closure, 424
, 175 short interspersed nuclear elements,
, 149, 166–168 114
antisymmetric, 480 short tandem repeat, 114
Index 525

Single Nucleotide Polymorphism, 106, decomposition, 321


107 density-constrained basis, 162
SNP, 106, 107 Generalized Perron-Frobenius,
specification 124
nontrivial, 475 inclusion-exclusion principle, 53
standard deviation, 72 irreducible intervals, 306
standard error, 73 irreducible matrix, 124, 126
Standardization theorem, 73 law of large numbers, 75
stationary process, 120 Markov’s inequality, 70
statistical significance, 80, 216 maximal begets maximal, 390
statistics maximal solid pattern, 149
descriptive, 72 nonunique basis, 103
inferential, 72 pair invariant, 481
summary, 72 Prüfer’s, 38
stochastic quorum-constrained basis, 163
doubly, 122, 233 sample mean & variance, 74
matrix, 122, 233 set linear arrangement, 429, 462
process, 120 standardization, 73
vector, 122 suffix tree, 147
straddle graph, 421 unique basis, 95
string, 140 unique maximal (permutation
string pattern, 101 pattern), 269
strongly connected topological motif, 102, 356
algorithm, 125 vertex set
graph, 124 indistinguishable, 372
substitution matrix maximally indistinguishable,
BLOSUM, 233 373
Dayhoff, 233 total order, 458
PAM, 233 transition probability, 121, 192
suffix tree, 144, 205 transitive, 95, 480
generalized, 236 reduction, 480, 498
support, 141 partial order, 420
Symmetric Lemma, 489 tree
2-3, 30
tandem repeats, 114 AVL, 30
telescope, 34, 335, 336 B-, 30
interval size, 336 balanced, 29
theorem bifurcating, 22
P -arrangement, 335 binary, 22
basis, 161 decomposition, 320
Bayes’, 51, 225 PQ, 426
central limit, 78 red-black, 30
Chebyshev’s inequality, 71 suffix, 144
compact motif, 390 trend pattern, 103
conjugate maximality, 376 trie, 144
526 Pattern Discovery in Bioinformatics: Theory & Algorithms

ordered enumeration, 435, 463,


464
Trie partial-order Lemma, 449
Turing Test, 113

union
closure, 424
of events, 52
unique pattern, 108

Vandermonde’s identity, 63
variable number tandem repeats,
114
Viterbi Algorithm, 130, 132
VNTR, 114

Y-STR, 114

z-score, 198, 199

You might also like