0% found this document useful (0 votes)
19 views10 pages

Indexing Hypertext - 2013 - Journal of Discrete Algorithms

Uploaded by

boussa ibrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

Indexing Hypertext - 2013 - Journal of Discrete Algorithms

Uploaded by

boussa ibrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Journal of Discrete Algorithms 18 (2013) 113–122

Contents lists available at SciVerse ScienceDirect

Journal of Discrete Algorithms


www.elsevier.com/locate/jda

Indexing hypertext ✩
Chris Thachuk
Department of Computer Science, University of British Columbia, Vancouver, Canada

a r t i c l e i n f o a b s t r a c t

Article history: Recent advances in nucleic acid sequencing technologies have motivated research into
Available online 9 October 2012 succinct text indexes to represent reference genomes that support efficient pattern
matching queries. Similarly, sequencing technologies can also produce reads (patterns)
Keywords:
derived from transcripts which need to be aligned to a reference transcriptome.
Succinct text indexing
Hypertext
A transcriptome can be modeled as a hypertext—a generalization of a linear text to a
Pattern matching graph where nodes contain text and edges denote which nodes can be concatenated.
Motivated by this application, we propose the first succinct index for hypertext. The index
can model any hypertext and places no restriction on the graph topology. We also propose
a new exact pattern matching algorithm, capable of aligning a pattern to any path in
the hypertext, that is especially efficient when few nodes of the hypertext share common
prefixes or when each node has constant degree.
© 2012 Elsevier B.V. All rights reserved.

1. Introduction

Fueling the discovery of genetic variation amongst populations and individuals has been the application of next gen-
eration sequencing technologies (NGS). The new technologies focus on massively parallel sequencing and are capable of
producing millions, and even billions, of reads (patterns) in a typical run [10,9,17]. For an overview of different sequencing
technologies, see the summary given by Myllykangas et al., [17] and references therein. The task of efficiently aligning the
large number reads produced by the new technologies, to a reference genome, is one of the most actively researched prob-
lems in contemporary bioinformatics. NGS is also being utilized to capture data from the transcriptome; a process referred
to as RNA-Seq [16]. Instead of sequencing genomic DNA, RNA-Seq aims to sequence the complementary DNA (cDNA) of
RNA molecules in a cell. Transcriptome read alignment is providing valuable information to researchers, beyond genomic
sequencing. In particular, this technology can be used to quantify the level of expression of various transcripts by sequencing
messenger RNA, thus implicating the relative expression level of proteins.
Much more progress has been made in mapping reads from genome data to reference genomes than on aligning reads
derived from transcriptomes. The latter problem is harder by the very nature of the events it is capable of capturing com-
pared to genomic sequencing. Since introns are spliced from genes in the process of transcription (see Fig. 1), spliced reads
may map to two regions of the genome that are separated by many hundreds or thousands of bases. The difficulty of align-
ing NGS reads that span intron boundaries is exacerbated by their short length, and often is not attempted, resulting in a
significant loss of information. When compared to aligning patterns to a reference text, the transcriptome read alignment
problem is modeled more accurately by the problem of aligning patterns to a hypertext.
Informally, hypertext is a generalization of text from a linear structure to a directed graph, G = ( V , E ), with each node
being a fragment of text and edges implying which fragments of text can be appended; thus, any path in the graph is
a substring of the hypertext. The example transcriptome in Fig. 1 consists of five overall exons between two genes. The


A preliminary version of this work appeared in the 18th International Symposium on String Processing and Information Retrieval (SPIRE 2011).
E-mail address: [email protected].

1570-8667/$ – see front matter © 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.jda.2012.10.001
114 C. Thachuk / Journal of Discrete Algorithms 18 (2013) 113–122

Fig. 1. A simple genome, G, is shown having five exons contained in two genes. Exons are strings over the four letter alphabet of DNA. Below is the
corresponding transcriptome, T , which consists of five transcripts. Transcripts are formed by the concatenation of certain exons from G. Above is the
splicing graph, S, where each of the five nodes correspond to one of the five exons from G, and each directed edge denotes splicing events (concatenation
of exons) that are found in T .

splicing events, and valid transcripts are also shown. The resulting hypertext model of this transcriptome has a node for
each exon, and an edge between exons joined by a splicing event, resulting in two components (one for each gene).
The seminal work on pattern matching in hypertext is due to Manber and Wu [15] who proposed a O (| V | + m| E | +
occ log log m) time algorithm, where m is the length of the pattern and occ are the number of matches. Akutsu [1] proposed
an O (n) algorithm for matching in hypertext forming a tree structure, where n is the total length of text in all nodes. Park
and Kim [21] considered the case where the hypertext forms a directed acyclic graph by proposing a O (n + m| E |) time
algorithm, under the assumption that no node in G matches to more than one position in the pattern. Amir et al., [2]
proposed an algorithm with the same runtime complexity; however, theirs was the first algorithm for the case of hypertext
forming a general graph. Amir et al., [2] and Navarro [18] also considered the problem of approximate matching in hypertext.
Surprisingly, no (succinct) index for hypertext has been previously proposed. In this work, we propose a succinct index to
model hypertext. Our index can model any hypertext forming a general graph and makes no restriction to the topology. We
also propose a new pattern matching algorithm, capable of aligning a pattern to any path in the hypertext, that is especially
efficient for hypertexts where few nodes share common prefixes or when all nodes are of constant degree. In particular, our
log | V | log | V |
new algorithm can report all patterns crossing at most one edge in O (m log σ + m log log | V | + occ1 log n + occ2 log log | V | ) time,
where occ1 (occ2 ) is the number of matches that cross no (one) edge. We also consider a restricted version of the problem,
where only certain paths in the hypertext are considered valid, and also prove the worst case query time complexity is
improved for other restrictions including graph topology. A main contribution of our paper is to show the correspondence
between the hypertext matching problem and the problem of matching text containing wildcards. As we will show, the
former can be viewed as a generalization of the latter. In particular, recent strategies for indexing text with wildcards are
applicable for indexing hypertext. Improvement to one problem may immediately lead to improvements of the other.
While our results in this work are general and relevant to applications that are appropriately modeled by a hypertext,
our original motivation was to better model the transcriptome read alignment problem. We view the results in this work
a theoretical contribution towards that end. However, the reads produced by current sequencing technologies contain se-
quencing errors—errors introduced during the sequencing process—in addition to the genetic variation expected between the
experimental sequence and a reference sequence. A significant challenge that must be overcome, before these approaches
could yield practical tools for transcriptome read alignment, is to efficiently support approximate pattern matching queries.

2. Preliminaries

For a string T of length n over an alphabet Σ , let T [i ] denote the ith character of T , and let T [i . . . j ] denote the substring
from the ith to the jth character of T , for i  j. The ith suffix of T is the substring T [i . . . n]. A suffix array [14] of T $, the
string T followed by a special sentinel character $, is a permutation of the integers [1, . . . , n + 1] giving the lexicographic
order of all suffixes of T $, where $ ∈ / Σ and $ < c, ∀c ∈ Σ . Conceptually, the suffix array can be thought of as a matrix of
strings where each row is a different suffix of T $, and the rows are in lexicographic order. See Fig. 2 for an example.

2.1. Compressed text indexes

An FM-index F is a succinct representation of the Burrows-Wheeler transform (BWT) of the string T , denoted as FBWT ,
in addition to some auxiliary data structures. The ith character in the string FBWT corresponds to the character, in T , that
precedes the ith lexicographically smallest suffix of T . See Fig. 2 for an example. The structure of the suffix array of T
can be inferred directly from FBWT by the so-called L F -mapping. Specifically, the jth occurrence of a character c in FBWT
corresponds to the jth lexicographically smallest suffix of T that begins with the character c.
C. Thachuk / Journal of Discrete Algorithms 18 (2013) 113–122 115

Fig. 2. (Left) An example of the underlying suffix array and BWT string for the forward index F of the text T = φ acaφ g φ gaφ cg φ ct$, representing the
serialization of text in exons e 1 , . . . , e 5 , supposing those five exons consist of the five sequences {aca, g , ga, cg , ct } respectively, from Fig. 1. (Right) The
underlying suffix array and BWT string for the reverse index R of the text T R = φ acaφ g φ ag φ gc φ tc$.

Ferragina and Manzini [7] showed that a pattern P [1 . . . m] can be matched against T by performing a backward search
in F. The backward search initially finds matches of P [m . . . m] in T , then attempts to extend those into matches of P [m −
1 . . . m] in T , and so on. The search maintains a suffix array range denoting the interval in the sorted suffixes that match
the current pattern as a prefix. If the final range [a, b] is non-empty, then P matches in T exactly b − a + 1 times. An
FM-index can also report the locations of all matches in T , if any. Details of backward search, and compressed suffix arrays
in general, can be found in the review by Navarro and Mäkinen [19]. There are numerous implementations of the FM-index
with various time and space trade-offs. We use a result of Mäkinen and Navarro [13]; however, any compressed text index
based on the Burrows–Wheeler transform is compatible with our proposed approach.

Lemma 1. (See Mäkinen and Navarro [13].) An FM-index F , based on the wavelet tree of T BWT , can be represented in nH k ( T ) +
o(log σ ) bits of space, for any k  α logσ n − 1 and 0 < α  1, such that the suffix array range of every suffix of a string X can be
computed in O (| X | log σ ) time, and each match of X in T can be reported in an additional O (log n) time where T is a text of length n
over an alphabet of size σ .

2.2. Full-text dictionaries

A full-text dictionary is designed to index a collection of patterns. It has the same capabilities as an FM-index, and in addi-
tion can report all patterns from the dictionary that are contained in an input text P . Thachuk proposed an implementation
that uses auxiliary structures in addition to any FM-index that indexes a serialization of all patterns in the dictionary [25].
Using the FM-index implementation of Lemma 1 we attain the following result.

Lemma 2. (See Thachuk [25].) A succinct full-text dictionary D of a set of k patterns, having combined length n when serialized in a
text T , over an alphabet of size σ , can be represented in nH k ( T ) + o(log σ ) + n(2 + o(1)) + k(log nk  + 2 + o(1)) bits such that the
matching statistics of a string P with respect to D can be determined in O (| P | log σ ) time, all occ1 patterns contained in P can be
reported in an additional O (occ1 ) time, and all occ2 positions where P is contained within a pattern can be reported in an additional
O (occ2 log n) time.

The matching statistics of a string P , with respect to T , is a list of tuples (q, [a, b]), one for each suffix of P , where q
denotes the longest match of the suffix anywhere in T , and [a, b] is the suffix array range of the matches [20].
When an query text matches properly within a dictionary pattern, we can make use of a fully indexable dictionary to
determine the pattern id in O (1) time.

log log n
Lemma 3. (See Raman et al., [22].) A bit vector B of length n containing d 1 bits can be represented in d log nd + O (d + n log n ) bits
to support the operations rank1 (B, i ) giving the number of 1 bits appearing in B[1..i ] and select1 (B, i ) giving the position of the
ith 1 in B in O (1) time.
116 C. Thachuk / Journal of Discrete Algorithms 18 (2013) 113–122

2.3. Orthogonal range query structures

An orthogonal range query data structure indexes a set of points from a two-dimensional grid so that given as input a
bounding rectangle, all points contained within the bounds can be reported efficiently.

Lemma 4. (See Bose et al., [4].) A set N of points from universe M = [1..k] × [1..k], where k = | N |, can be represented in (1 +
log k log k
o(1))k log k bits to support orthogonal range counting in O ( log log k ) time, and orthogonal range reporting in O ((1 + occ) log log k ) time,
where occ is the size of the output.

2.4. Succinct graph representation

The succinct graph representation of Farzan and Munro supports a number of graph topology query operations in O (1)
time using the best space achievable [6]. In this application, we only require the use of efficient adjacency queries. Their
result is stated in terms of boolean matrices supporting access queries. In terms of graphs, this is equivalent to determining
adjacency of nodes using an adjacency matrix.

n2 
Lemma 5. (See Farzan and Munro [6].) A boolean matrix of size n × n with m ones can be represented in (1 +  ) lg m
bits for any
constant  > 0 supporting access (and successor) queries in O (1) time.

2.5. Hypertext

A hypertext generalizes the notion of text to be a directed graph G = ( V , E ) such that each node v ∈ V contains text
over an alphabet Σ and the outgoing edges of v are incident to nodes containing text that can follow v’s. A match of a
pattern P to the hypertext G is a path p = v 1 , . . . , v k through G, and an offset l into the first node v 1 , such that P matches
the concatenation of the text in nodes v 1 , . . . , v k , beginning at position l in v 1 , and ending at some prefix of v k .

Pattern matching in hypertext problem

Instance: A hypertext G and a pattern P .


Question: Which paths in G match P ?

Previous algorithms for matching in hypertext focused on reporting only the initial node, and offset within that node,
of paths in G matching P [1,2,21]. For our motivating problem of aligning patterns to a transcriptome, the actual path
is required to be known, and that is our focus in the remainder of the paper. However, our matching algorithm can be
simplified if only the initial node of a match (and the offset within the node) is desired.

3. Construction of the hypertext index

The succinct hypertext index is a collection of three sets of data structures: those indexing the node text, those indexing
the graph topology, and useful auxiliary structures. In our pattern matching algorithms we find it useful to identify nodes
of the graph by two different identifiers: forward id, and reverse id. This is reflected in our descriptions of the data structures
below. The forward id gives the prefix lexicographic rank of the text contained within the node as compared with all other
nodes in V . Similarly, the reverse id gives the rank with respect to the suffix lexicographic rank. We show how these
ids can be determined in Section 3.3. However, for many hypertext applications there is a canonical id associated with
each node, giving an absolute ordering of nodes, that should be used for reporting matches. This is the case, for instance,
when modeling a transcriptome where each node represents an exon that can be identified by their order in a reference
genome. Our description below for indexing a hypertext will focus on minimizing the query time of the general case. The
representation is redundant as both the text and graph topology is represented twice. In Section 5, we show how space can
be reduced by increasing the worst case query time.

3.1. Indexing node text

For a given hypertext G = ( V , E ) over an alphabet Σ of size σ , we construct a text T = φ v 1 φ v 2 φ . . . φ v | V | $, of length n,


that is a serialization of the combined text of the nodes of V , each prefixed by a character φ , such that $ < φ < c, ∀c ∈ Σ .
We assume the nodes are concatenated in order of their canonical id. We will construct and store a full-text dictionary index
F of T . We also construct and store an FM-index R for T R , the serialization of the reverse of all node text. We let FBWT
(RBWT ) denote the BWT string for F (R). See Fig. 2 for an example. Note that it is not necessary to store T or T R . The
full-text dictionary provides functionality for determining all nodes contained within a given pattern (dictionary matching
problem), all nodes that contain a given pattern (pattern matching problem), and also which nodes contain a suffix of a
pattern as a prefix. The reverse index provides the functionality to determine which nodes contain a prefix of a pattern as
C. Thachuk / Journal of Discrete Algorithms 18 (2013) 113–122 117

Table 1
Inventory of space usage for succinct index of a general hypertext. Sections 5 and 6 explore the removal of various components of the overall index.

Symbol Description Space (bits)


F full-text dictionary of forward text nH h ( T ) + o(log σ ) + n(2 + o(1)) + | V |(log | Vn |  + 2 + o(1))
R succinct suffix array of reverse text nH h ( T ) + o(log σ )
D FID used to identify canonical id from position in T | V | log n
|V | + O (| V | + n logloglogn n )
P 2D point structure containing graph topology (1 + o(1))| E | log | E |
n2 
Q succinct graph representation (1 +  ) log m
ΠR→F mapping of reverse id to forward id | V |log | V |
Π F →C mapping of forward id to canonical id | V |log | V |

a suffix. In order to determine the canonical id of a node that properly contains a match of a query pattern we utilize a
fully indexable dictionary (FID) similar to applications for document listing [23]. Specifically, we build D, an FID to represent
a bitvector having the same length as T with 1 bits demarcating the ending of patterns in T . The full-text dictionary can
report the absolute position of a match in T . Determining the number of 1 bits, prior to this position in the FID, will yield
the canonical id of the matched node. If the match is within the node with canonical id 0, then the offset is simply the
position within T . Otherwise, it is the distance to a previous position marked with a 1 bit.

3.2. Storing graph topology

We store the graph topology twice. First, we construct a 2D range query index, P, that is heavily utilized for matching
patterns crossing a single edge. Conceptually, the y-axis corresponds to forward ids and the x-axis corresponds to reverse
ids. A point (a, b) is added to the index if and only if in E there is an edge from the node with reverse id a to the node
with forward id b. Unfortunately, the 2D range query structure does not permit us to determine if two specific nodes are
adjacent in O (1) time; therefore, we construct Q using the succinct graph representation of Farzan & Munro [6] for this
purpose. Specifically, all nodes in Q are referenced by their forward id.

3.3. Auxiliary data structures

Each node can be ranked according to its prefix lexicographic order in the forward index F. For instance, we can deter-
mine the prefix lexicographic order of all | V | nodes by performing backward search on the serialized text T . After the text
‘v | V | $’ has been matched in F, three facts are known: (i) the matching suffix array range [a, b] will be a size one interval
(i.e., a = b) since $ occurs only in one position of T , (ii) FBWT [a] = φ since each node is prefixed by a φ character, and
(iii) the rank of the φ character at position a in FBWT corresponds to the prefix lexicographic rank of node v | V | , with respect
to all other nodes in V , due to the properties of the L F mapping. We can continue to determine the prefix lexicographic
order of all nodes in V in a single traversal of T . Similarly, we can determine the suffix lexicographic order of all nodes
using the reverse index. We find it convenient to store a permutation Π R → F that maps the suffix lexicographic order of a
node in the reverse index to the prefix lexicographic order of that node in the forward index. Furthermore, since we report
all matches with respect to the canonical id label, we also store a permutation Π F →C that maps forward ids to canonical
ids. Our overall space usage for a general hypertext is summarized in Table 1 and Lemma 6.We explore how some of these
data structures can be removed in Sections 5 and 6.

Lemma 6. A hypertext G = ( V , E ) can be represented in 2nH k ( T ) + o(log σ ) + n(2 + o(1)) + O (| E | log | E |) bits by the above scheme,
where the text of the nodes in V are over an alphabet of size σ and have a combined length of n − | V |.

4. Pattern matching in the hypertext index

We now demonstrate that extensions of techniques used to solve the problem of matching a pattern against a text con-
taining wildcards are applicable and effective for matching a pattern in a hypertext. In a solution to the wildcard matching
problem, Lam et al., [12] classified a match of a pattern P to the text T into three cases: P matches a position in T contain-
ing no wildcard group, P matches a position in T containing one wildcard group, and P matches a position in T containing
more than one wildcard group, where a wildcard group is a consecutive sequence of wildcard characters. We will solve the
problem of matching a pattern P in a hypertext G = ( V , E ) by considering three analogous cases: (i) P does not span any
edge from E, (ii) P spans exactly one edge from E, and (iii) P spans more than one edge from E. As we will see, case (i) is
identical, case (ii) is a restriction, and case (iii) is a generalization of the respective wildcard cases.

4.1. Preprocessing the pattern

Considering that a match of P is a path through G, we need an efficient means to determine which nodes contain P as
a substring, which are contained within P , which nodes contain a prefix of P as a suffix, and which contain a suffix of P
118 C. Thachuk / Journal of Discrete Algorithms 18 (2013) 113–122

as a prefix. Consider for a moment how we may determine which nodes contain the suffix P [i . . . m] as a prefix. Suppose
P [i . . . m] has a non-empty suffix array range [a, b] in the forward index F. If P [i . . . m] is a prefix of some node v i then
two things must be true: (i) [c , d], the suffix array range of v i in F, must be a sub-interval of [a, b], and (ii) FBWT [c . . . d]
must contain a φ character corresponding to v i , since all node texts are prefixed by the φ character in the construction
of F. Therefore, by determining the rank of the first and the last φ characters in FBWT [a . . . b], we will determine a range of
forward ids corresponding to nodes that contain P [i . . . m] as a prefix. Using backward search we can determine the range
of matching forward ids, if any, for each suffix of P in O (m log σ ) time (by Lemma 1). To determine which nodes contain
a prefix of P as a suffix, we can instead determine which nodes, when their text is reversed, contain a suffix of P R , the
reverse of P , as a prefix. Therefore, by performing a backward search of P R in the reverse index R, a range of reverse
ids can be determined in O (m log σ ) time (by Lemma 1). Finally, we also determine the matching statistics of P [1 . . . m]
and P [2 . . . m − 1] with respect to F in O (m log σ ) time (by Lemma 2). The matching statistics for P [1 . . . m] are used to
determine matches of the pattern within nodes, while the matching statistics for P [2 . . . m − 1] are used to determine
matches of nodes contained within P [2 . . . m − 1]. The forward and reverse id ranges for every prefix and suffix of P as well
as the matching statistics for P [1 . . . m] and P [2 . . . m − 1] can be stored in O (m log n) bits.

4.2. Matching within a node

If a match of P in G does not span an edge, then P must match as a substring of some node. Let (q, [a, b]) be the
matching statistics of P [1 . . . m] calculated in the preprocessing step. If q = m then there exists at least one match of P as a
substring of a node of G. Furthermore, the suffix array range [a, b] is non-empty and P is contained as a substring of one or
more nodes exactly b − a + 1 times. In each instance, using the full-text dictionary F, we can determine the location of the
match, in T , in O (log n) time. Using the fully-indexable dictionary D, we can determine the canonical id of the matching
node, as well as the offset within that node.

Lemma 7. For a pattern P of length m, the occ number of matches of P within a node of G can be counted in O (m log σ ) time. Their
locations can be reported in an additional O (occ log n) time. The working space is O (m log n) bits.

4.3. Matching across a single edge

If a match of P in G spans a single edge, then there must exist some i, 1 < i  m, such that P [1 . . . i − 1] is a suffix of
some node v j ∈ V , P [i . . . m] is a prefix of some node v k ∈ V , and the edge ( j , k) ∈ E. In the preprocessing step, the range
[c , d] of forward ids corresponding to nodes that contain P [i . . . m] as a prefix, as well as the range [a, b] of reverse ids
corresponding to nodes that contain P [1 . . . i − 1] as a suffix, were stored. Similar to other applications in stringology [25,
24,5,11], we make use of the range query data structure to relate the two ranges. If both ranges are non-empty, we can
determine exactly which pairs of ids are connected by a forward edge by reporting all points in P contained in the range
[a, b] × [c , d]. The auxiliary structures Π R → F and Π F →C can be used to report the canonical ids of matches.

Lemma 8. For a pattern P of length m the occ number of matches of P that cross a single edge of G can be counted in O (m log σ +
log | V | log | V |
m log log | V | ) time. Their path descriptions can be reported in an additional O (occ log log | V | ) time. The working space is O (m log n) bits.

4.4. Matching across multiple edges

If a match of P in G spans more than one edge, then P [2 . . . m − 1] must contain at least one node of G as a substring.
The strategy here, as in previous solutions to the text with wildcard problem [12,24,25], is an extension of the dictionary
matching problem: first identify all γ nodes contained within P [2 . . . m − 1], and second, for each of those γ candidate
matches, determine if it can be extended into a full match of P in G. Consider a candidate match of a node v j , with
forward id j, to the substring P [i . . . i + k − 1], where i > 1 and i + k − 1 < m. This candidate can be extended into a full
match if both the suffix condition—there exists a path leaving v j matching P [i + k . . . m]—and the prefix condition—there exists
a path ending at v j matching P [1 . . . i − 1]—are satisfied.
Recently, Thachuk proposed a dynamic programming algorithm to solve the corresponding text with wildcards version
of the problem [25]. The algorithm works in m stages, by considering successively longer suffixes of P , and in the process
determines if the suffix and prefix conditions of candidate matches are satisfied. We will adapt this algorithm for our
purposes here. However, we note that verifying the suffix (prefix) condition in the hypertext problem is more challenging
as we must consider any path beginning (ending) at a node representing a candidate match. In the text with wildcards
problem, one need only verify the text immediately preceding (succeeding) a candidate match position. This can be viewed
as verifying in a hypertext that forms a single path. For this reason, the algorithm we propose below is a generalization of
the original algorithm.

4.4.1. Overview of the algorithm


Conceptually, the algorithm will consider successively longer suffixes of P [2 . . . m − 1]. Specifically, for each suffix
P [i . . . m − 1], for i = m − 1, m − 2, . . . , 2, the algorithm will consider all γi nodes of G that prefix P [i . . . m − 1] using
C. Thachuk / Journal of Discrete Algorithms 18 (2013) 113–122 119

the full-text dictionary F. Each of these γi nodes is a candidate requiring the suffix condition to be first verified, and if
successful, the prefix condition is tested. The algorithm will maintain a compact list of all sub-paths of G that are matched
by P [i . . . m]. The head of the list for suffix i will be stored at W[i ], a working space array maintained during the search.
In later stages of the algorithm, we will determine if these sub-paths can be extended to match longer suffixes of P . Over-
all, there are γ candidate positions that will be evaluated. Note that the exact count of candidates, γ , can be determined
using F, prior to the first stage of the algorithm in O (m log σ ) time using a counting query. This permits us to allocate
sufficient working space. Our algorithm attempts to track all matching sub-paths in as little working space as possible. We
describe the information tracked during the course of the algorithm, and comment at the end on the overall working space
complexity. In what follows, we describe how the suffix and prefix conditions are verified for a candidate node v j that
matches a k length prefix of P [i . . . m − 1]. This same procedure will be used to verify all γ candidates, across all suffixes of
P [2 . . . m − 1].

4.4.2. Verifying the suffix condition


We must verify that there exists a sub-path in G matching P [i + k . . . m] that begins at some node v t such that ( j , t ) ∈ E.
There are two cases to consider: such a sub-path is a prefix of v t (and thus ends within v t ), or it properly contains v t . We
will refer to the former as a sub-path initiation event, and the latter as a sub-path extension event. For each candidate node v j
we must consider both types.
To verify an initiation event, we first determine the range [a, b] of forward ids corresponding to nodes that contain
P [i + k . . . m] as a prefix. This range was stored in the preprocessing step. Then, we must determine if any of those nodes
have an incoming edge from node v j . Suppose that node v j has forward id j and therefore reverse id j  = Π R → F [ j ].
A counting query in the range [ j  , j  ] × [a, b] of P determines cnt init , the number of matching nodes. Specifically, cnt init is
the number of sub-paths that originate at node v j , match P [i . . . m], and end within a node connected to v j .
An extension event implies that P [i + k . . . m] must match a sub-path that contains, but does not end within, a node
connected to v j . Therefore, to verify an extension event, we must determine if any putative sub-paths stored in the list at
W[i + k] begin with a node v t such that ( j , t ) ∈ E. (Note that if one or more of these sub-paths do exist, they would have
been stored in W[i + k] at an earlier stage of the algorithm.) For each of the at most γ entries in W[i + k], we use the
canonical id of its initial node, and the canonical id of v j , to perform an adjacency query using Q. Note that since we only
need to establish adjacency between nodes, then no assumption on the graph topology is made and therefore any directed
graph (possibly cyclic) is handled correctly. Let cnt ext be the number of entries in W[i + k] that are connected by an edge
from v j .
If the suffix condition is satisfied we append a new putative sub-path entry to the list at W[i ] and associate with it v j as
the initial node. If cnt ext > 0, then we associate with that entry a list of the cnt ext offsets that denote the entries in W[i + k]
which are connected to v j and form new putative sub-path matches for P [i . . . m]. We also associate with each entry the
count of sub-paths it begins. Consider that each of the cnt ext entries in W[i + k] connected to v j may represent many
sub-paths. The number of sub-paths each represents is stored in the entry. Therefore, the number of sub-paths represented
by our new entry is the sum of the counts for these previous cnt ext entries, plus cnt init .

4.4.3. Verifying the prefix condition


If the suffix condition is satisfied for a candidate node v j , we can test the prefix condition. We need to determine if there
exists one or more nodes that contain P [1 . . . i − 1] as a suffix and have an outgoing edge to node v j . In the preprocessing
step, we stored the range [a, b] of reverse ids corresponding to nodes that contain P [1 . . . i − 1] as a suffix. The cnt p number
of matching nodes can be found by querying the range [a, b] × [ j , j ] in P. If cnt p > 0, and the current count of sub-paths
beginning at v j that match P [i . . . m] is cnt s , then G contains cnt p × cnt s paths matching P .

4.4.4. Reporting all matching paths


Whenever a prefix condition is verified, all matching paths can be enumerated by a simple backtracking procedure using
the information previously stored in W and the new prefix matches. The point data structure P is once again queried to
determine the forward ids of nodes that contain the end of a matching path. These can be translated into canonical ids for
reporting.

Lemma 9. For a pattern P of length m, the occ number of matches of P that cross more than one edge in G, can be determined in
log | V | log | V |
O (m log σ + γ 2 + γ log log | V | ) time. Their path descriptions can be reported in an additional O (h + occ log log | V | ) time, where h is the
total number of nodes in all occ sub-paths matched by P . The working space is O (m log n + γ 2 log γ + γ log | V |) bits.

Proof. For each candidate node v j , the suffix condition must be verified by checking for both initiation events and for
log | V |
extension events. When verifying initiation events, a range query on P is performed in O ( log log | V | ) time and there are at
most O (γ ) initiation events. When verifying extension events for v j , at most O (γ ) previous entries representing putative
sub-paths are considered to be extended by v j . For each putative sub-path, an adjacency query is performed in O (1) time
using the graph representation of Q. Thus, verifying extension events for v j takes O (γ 2 ) time. The suffix condition can be
verified for all γ candidate nodes in O (γ 2 + γ logloglog| V| V| | ) time. Verifying the prefix condition is analogous to verifying suffix
120 C. Thachuk / Journal of Discrete Algorithms 18 (2013) 113–122

log | V |
initiation events and it takes O (γ log log | V |
) time to verify the at most γ candidates that satisfy the suffix condition. This
log | V |
yields an overall worst case time O (m log σ + γ 2 + γ log log | V | ).
The working space includes the O (m log n) bits from the preprocessing step, and O (γ log | V |) bits to store counts of
putative sub-paths and initial nodes of those sub-paths. However, the working space is dominated by storing back pointers
(offsets) in the putative sub-path entries of the working array W to each previous entry that forms a valid sub-path match
for a suffix of P . There are at most O (γ ) entries in W. Thus, each can be uniquely identified with log γ  bits. In the
worst case, each entry stores O (γ ) back pointers giving an overall working space of O (m log n + γ log | V | + γ 2 log γ ) bits.
Importantly, we note that the total number of matches γ can be determined by F in a preprocessing step in O (m log σ )
time using a counting query in order to allocate sufficient working space. 2

Combining Lemmas 6 through 9 we have our main result. We note that our query time is dependent on γ , the number
of occurrences of nodes as substrings of the query pattern P . The query algorithm is designed under the assumption that,
in practice, γ is expected to be proportional to the length of the pattern, P . As we discuss in Section 6, this can be shown
in the worst case when no node is the prefix of another. However, when many nodes share a common text, and all match
in numerous positions of the query pattern, then the query time becomes dominated by this parameter. The approach
proposed here is inefficient in these cases and is not faster than pattern matching without an index.

Theorem 1. A hypertext G = ( V , E ) can be represented in 2nH k ( T ) + o(log σ ) + n(2 + o(1)) + O (| E | log | E |) bits, where the text of the
nodes in V are over an alphabet of size σ and have a combined length of n − | V |, such that all matches of a pattern P of length m can be
log | V | log | V | log | V |
counted in O (m log σ + m log log | V | + γ 2 + γ log log | V | ) time, and reported in an additional O (occ1 log n + (occ2 + occ3 ) log log | V | + h)
time, where occ1 is the number of matches within a node of G, occ2 is the number of matches crossing a single edge, h is the total
number of nodes in all occ3 sub-paths matching P that cross more than one edge, and γ is the number of occurrences of a node as a
substring of the pattern P . The working space is O (m log n + γ 2 log γ + γ log | V |) bits.

5. Reducing the index space

The index described in Section 3 is redundant in order to decrease worst case query time for a general hypertext, or
support more general match reporting. If no predetermined canonical id needs to be reported for matching nodes, then
the forward id can be used and the permutation Π F →C can be eliminated. In this case, the topology stored in Q will
be in terms of forward ids as well. The topology of the graph need not be stored twice. The Q representation was used to
log | V |
enable O (1) time adjacency queries; however, the point data structure P can perform adjacency queries in O ( log log | V | ) time.
log | V |
This increases the γ 2 term in the time complexity of the pattern matching algorithm to γ 2 log log | V | . Finally, in addition to
indexing the node text, the reverse of each node text is also indexed in order to determine which nodes contain a prefix of a
query pattern as a suffix. Very recently, Hon et al., showed that a sparse suffix tree can be used for this purpose to calculate
the appropriate suffix array ranges [8]. This decreases the text index space term from 2nH k ( T ) to nH k ( T ) + O (| V | log n)
bits; however, the time increases from O (m log σ ) to O (m log n) time.

6. Considering restricted hypertext

In this section, we consider a number of interesting yet relevant restrictions to general hypertext. All restrictions apply
to our motivating example of modeling transcriptomes as hypertext.

6.1. Path constraints

While our motivating problem of aligning transcripts to transcriptomes is better modeled using a hypertext index than
a linear text index, the model does not completely capture all necessary information. Specifically, the hypertext models the
splicing graph; however, the set of valid transcripts in the transcriptome is a set P of paths through the splicing graph. Not
every path in the splicing graph is necessarily a valid transcript. We now show how we can easily extend our index to only
report matches of a pattern P if they are a sub-path of at least one path in P .
For illustrative purposes, assume we have constructed a hypertext index for the example in Fig. 1 and all transcripts
are valid, expect for t 3 . Further suppose the node labels in the hypertext correspond to the exon labels in the figure.
Then the set of valid paths are P = {[e 1 , e 2 ], [e 1 , e 3 ], [e 4 ], [e 4 , e 5 ]}. We construct a serialization of these paths as a string
S = φ e 1 e 2 φ e 1 e 3 φ e 4 φ e 4 e 5 , over the integer alphabet [1 . . . | V |] ∪ {φ}, where φ = 0. Next, we create an FM-index S for S. We
use the same matching algorithm as before, and for each candidate path p = p 1 , . . . , pk through G reported, we perform an
additional verification step. Specifically, we see if the string ‘p 1 . . . pk ’ exists in S as a substring by performing a backward
search query in S. Clearly, if the path description of p is a substring in S, then p is a sub-path of some valid path in P .

Theorem 2. A set of valid paths P can be represented in (1 + o(1))h log | V | bits, where h is the total number of nodes in all paths in
P , such that a candidate path p, crossing k nodes, can be verified in O (k log | V |) time.
C. Thachuk / Journal of Discrete Algorithms 18 (2013) 113–122 121

We note that the space required to store all valid paths (by canonical ids) of a transcriptome is still dominated by and
nearly negligible compared to the space to index the node text.

6.2. Topology constraints

The worst case complexity of the proposed pattern matching algorithm can be improved with a small modification when
the graph topology of the hypertext is know to be sparse; more specifically, if (G ), the maximum degree of any vertex in
V , is O (1). This is the case for our motivating problem of modeling transcriptomes.
For hypertexts with this property, we can adapt the algorithm when considering patterns crossing more than one edge.
When a candidate is under consideration, sub-path extension and sub-path initiation events can be simplified. Since there
are only a constant number of neighbors of any candidate, we can use Q and matching statistics stored during pre-
processing to determine if any neighbor of the candidate satisfies a sub-path initiation event. (The same can be done
log | V |
for prefix verification.) This simplifies the γ log log | V | term of the algorithm to γ .
To improve the time complexity for validating a sub-path extension event, we change how we store matching sub-paths
in the working space W. Specifically, instead of storing a list of heads of sub-paths, for each suffix of P , we keep the matches
in a balanced tree, such as a red–black tree [3], to support querying for a particular node id in O (log γ ) time, as there are
at most γ entries in any bucket. Overall, it will take O (γ log γ ) time to insert all γ entries and O (γ log γ ) time to check
for extension events for all candidates. This simplifies the γ 2 term to γ log γ , yielding the following result.

Theorem 3. For a hypertext G = ( V , E ), with node text having total combined length n, if (G ), the maximum degree of any vertex
in V , is O (1), then the pattern matching algorithm can be adapted as described above to improve the counting time complexity to
log | V |
O (m log σ + m log log | V | + γ log γ ), and the working space to O (m log n + γ log γ + γ log | V |) bits, where m is the length of the
pattern and γ is the number of occurrences of a node as a substring of the pattern.

6.3. Text constraints

A set of nodes in a hypertext form a prefix-free code if no node is the prefix of another. They are said to form a quasi-
prefix-free code if none is a prefix of more than O (1) other nodes.

Lemma 10. If the nodes of a hypertext G = ( V , E ) form a quasi-prefix-free code then for a pattern P of length m, the number of
candidate matches of any suffix of P is O (1) and therefore O (m) overall.

Proof. Suppose the nodes do form a quasi-prefix-free code but γ = ω(m). Then, by the pigeon hole principle, there must
be at least one suffix, P [i . . . m], that has t = ω(1) candidates as a prefix. Contradiction. 2

With this restriction, each suffix of P contains at most O (1) candidates as a prefix, and therefore, the working space
W will track at most O (1) head nodes for each position of P . Therefore, each potential sub-path extension event only
needs to check O (1) candidates in O (1) time using Q to perform adjacency queries. Since the number of overall candidates
γ = O (m), we have the following result.

Theorem 4. For a hypertext G = ( V , E ), with node text having total combined length n, if the text in the nodes of V form a quasi-
log | V |
prefix-free code, then the pattern matching algorithm has a counting time complexity of O (m log σ + m log log | V | ) and working space
O (m log n) bits.

7. Conclusions

We proposed a succinct index to model hypertext. The index can model any hypertext and places no restriction on the
graph topology. We proposed a new pattern matching algorithm, capable of aligning a pattern to any path in the hypertext.
We showed how the index can occupy space proportional to the compressed text and graph topology representation. We
also studied a number of interesting restrictions of hypertext, including when the graph consists of nodes with constant
degree, when only certain paths within the hypertext are considered valid, and when few nodes share common prefixes. In
these cases, which are motivated by the problem of aligning patterns to transcriptomes, we gave variants of the algorithm
with improved query time complexity that is dependent only on the pattern length and logarithmically in the size of the
hypertext. Previous algorithms for exact matching were at least linear in the size of the hypertext. We demonstrated the
correspondence between indexing and matching within hypertext to the problem of indexing and matching text containing
wildcards. Future improvements for one problem may immediately improve the other.
When no restrictions are placed on the topology of the hypertext, the time of our query algorithm depends on the
number of nodes of the hypertext that are contained as substrings of the pattern. In instances where the hypertext contains
many identical nodes, the query algorithm can actually become less efficient using our proposed algorithm than matching
122 C. Thachuk / Journal of Discrete Algorithms 18 (2013) 113–122

without an index. Fortunately, the number of matches of nodes within a pattern can be counted efficiently, prior to reporting
matches. However, this leads to the following open question: can an unrestricted hypertext be indexed such that the query
time is dependent only on the pattern length and logarithmically in the size of the hypertext?
Development of an efficient algorithm to report approximate matches in the hypertext index is an important future
direction. This is particularly important for biological sequences, including our motivating problem of modeling transcrip-
tomes. Incorporating support for approximate matching in the methods proposed here is crucial before they can be practical
for transcriptome read alignment.
Finally, our algorithm makes use of a succinct 2D orthogonal range structure to model the graph topology and perform
orthogonal range queries. To our knowledge, no succinct structure for this purpose also supports point existence queries in
constant time. Such a structure would eliminate the need to store the graph topology twice without increasing the hypertext
query time.

References

[1] T. Akutsu, A linear time pattern matching algorithm between a string and a tree, in: Symposium on Combinatorial Pattern Matching, 1993, pp. 1–10.
[2] A. Amir, M. Lewenstein, N. Lewenstein, Pattern matching in hypertext, Journal of Algorithms 35 (1) (2000) 82–99.
[3] R. Bayer, Symmetric binary B-Trees: Data structure and maintenance algorithms, Acta Informatica 1 (1972) 290–306.
[4] P. Bose, M. He, A. Maheshwari, P. Morin, Succinct orthogonal range search structures on a grid with applications to text indexing, in: Proceedings of
the 11th International Symposium on Algorithms and Data Structures, WADS’09, 2009, pp. 98–109.
[5] F. Claude, G. Navarro, Self-indexed text compression using straight-line programs, in: Mathematical Foundations of Computer Science, 2009, pp. 235–
246.
[6] A. Farzan, J. Munro, Succinct representations of arbitrary graphs, in: 16th Annual European Symposium on Algorithms, 2008, pp. 393–404.
[7] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Symposium on Foundations of Computer Science, 2002, pp. 390–398.
[8] W.-K. Hon, T.-H. Ku, R. Shah, S.V. Thankachan, J.S. Vitter, Compressed text indexing with wildcards, in: Proceedings of the 18th International Conference
on String Processing and Information Retrieval, SPIRE’11, 2011, pp. 267–277.
[9] D. Horner, G. Pavesi, T. Castrignano, P. De Meo, S. Liuni, M. Sammeth, E. Picardi, G. Pesole, Bioinformatics approaches for genomics and post genomics
applications of next-generation sequencing, Briefings in Bioinformatics 11 (2) (2010) 181–197.
[10] S. Jay, H. Ji, Next-generation DNA sequencing, Nature Biotechnology 26 (10) (2008) 1135–1145.
[11] J. Kärkkäinen, Repetition-based text indexes, Ph.D. thesis, University of Helsinki, 1999.
[12] T.-W. Lam, W.-K. Sung, S.-L. Tam, S.-M. Yiu, Space efficient indexes for string matching with don’t cares, in: Conference on Algorithms and Computation,
2007, pp. 846–857.
[13] V. Mäkinen, G. Navarro, Implicit compression boosting with applications to self-indexing, in: Conference on String Processing and Information Retrieval,
2007, pp. 229–241.
[14] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, in: ACM–SIAM Symposium on Discrete Algorithms, 1990, pp. 319–327.
[15] U. Manber, S. Wu, Approximate string matching with arbitrary costs for text and hypertext, in: IAPR International Workshop on Structural and Syntactic
Pattern Recognition, 1992, pp. 22–33.
[16] A. Mortazavi, B. Williams, K. McCue, L. Schaeffer, B. Wold, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nature Methods 5 (7)
(2008) 621–628.
[17] S. Myllykangas, J. Buenrostro, H.P. Ji, Overview of sequencing technology platforms, in: N. Rodríguez-Ezpeleta, M. Hackenberg, A.M. Aransay (Eds.),
Bioinformatics for High Throughput Sequencing, Springer, New York, 2012, pp. 11–25.
[18] G. Navarro, Improved approximate pattern matching on hypertext, Theoretical Computer Science 237 (1–2) (2000) 455–463.
[19] G. Navarro, V. Mäkinen, Compressed full-text indexes, ACM Computing Surveys 39 (1) (2007) 2.
[20] E. Ohlebusch, S. Gog, A. Kügel, Computing matching statistics and maximal exact matches on compressed full-text indexes, in: Symposium on String
Processing and Information Retrieval, 2010, pp. 347–358.
[21] K. Park, D. Kim, String matching in hypertext, in: Symposium on Combinatorial Pattern Matching, 1995, p. 318.
[22] R. Raman, V. Raman, S. Rao, Succinct indexable dictionaries with applications to encoding k-ary trees and multisets, in: ACM–SIAM Symposium on
Discrete Algorithms, 2002, pp. 233–242.
[23] K. Sadakane, Succinct data structures for flexible text retrieval systems, Journal of Discrete Algorithms 5 (1) (2007) 12–22.
[24] A. Tam, E. Wu, T.-W. Lam, S.-M. Yiu, Succinct text indexing with wildcards, in: Symposium on String Processing and Information Retrieval, 2009,
pp. 39–50.
[25] C. Thachuk, Succincter text indexing with wildcards, in: Symposium on Combinatorial Pattern Matching, 2011, pp. 27–40.

You might also like