Space-Efficient and Exact de Bruijn Graph Representation Based On A Bloom Filter
Space-Efficient and Exact de Bruijn Graph Representation Based On A Bloom Filter
Abstract. The de Bruijn graph data structure is widely used in nextgeneration sequencing (NGS). Many programs, e.g. de novo assemblers,
rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require
a large amount of memory ( 30 GB).
We propose a new encoding of the de Bruijn graph, which occupies an
order of magnitude less space than current representations. The encoding
is based on a Bloom filter, with an additional structure to remove critical
false positives. An assembly software implementing this structure, Minia,
performed a complete de novo assembly of human genome short reads
using 5.7 GB of memory in 23 hours.
Introduction
16k
) + 2.08 bits/k-mer.
2.08
For the human genome example above and k = 27, the size of the structure is
3.7 GB, i.e. 13.2 bits per node. This is effectively below the self-information of
the nodes. While this may appear surprising, this structure does not store the
precise set of nodes in memory. In fact, compared to a classical de Bruijn graph,
the membership of an arbitrary node cannot be efficiently answered by this
representation. However, for the purpose of many applications (e.g. assembly),
these membership queries are not needed.
We apply this representation to perform de novo assembly by traversing the
graph. In our context, we refer by traversal to any algorithm which visits all the
nodes of the graph exactly once (e.g. a depth-first search algorithm). Thus, a
mechanism is needed to mark which nodes have already been visited. However,
nodes of a probabilistic de Bruijn graph cannot store additional information. We
show that recording only the visited complex nodes (those with in-degree or outdegree different than one) is a space-efficient solution. The combination of (i) the
probabilistic de Bruijn graph along with the set of critical false positives, and
(ii) the marking scheme, enables to perform very low-memory de novo assembly.
In the first Section, the notions of de Bruijn graphs and Bloom filters are
formally defined. Section 3 describes our scheme for exactly encoding the de
Bruijn graph using a Bloom filter. Section 4 presents a solution for traversing
our representation of the de Bruijn graph. Section 6 presents two experimental
results: (i) an evaluation of the usefulness of removing false positives and (ii) an
assembly of a real human dataset using an implementation of the structure. A
comparison is made with other recent assemblers based on de Bruijn graphs.
The de Bruijn graph [5], for a set of strings S, is a directed graph. For simplicity, we adopt a node-centric definition. The nodes are all the k-length substrings
(also called k-mers) of each string in S. An edge s1 s2 is present if the
(k 1)-length suffix of s1 is also a prefix of s2 . Throughout this article, we will
indifferently refer to a node and its k-mer sequence as the same object.
A more popular, edge-centric definition of de Bruijn graphs requires that
edges reflect consecutive nodes. For k 0 -mer nodes, an edge s1 s2 is present if
there exists a (k 0 + 1)-mer in a string of S containing s1 as a prefix and s2 as
a suffix. The node-centric and edge-centric definitions are essentially equivalent
when k 0 = k 1 (although in the former, nodes have length k, and k 1 in the
latter).
The Bloom filter [8] is a space efficient hash-based data structure, designed
to test whether an element is in a set. It consists of a bit array of m bits,
initialized with zeros, and h hash functions. To insert or test the membership
of an element, h hash values are computed, yielding h array positions. The
insert operation corresponds to setting all these positions to 1. The membership
operation returns yes if and only if all of the bits at these positions are 1. A no
answer means the element is definitely not in the set. A yes answer indicates that
the element may or may not be in the set. Hence, the Bloom filter has one-sided
errors. The probability of false positives increases with the number of elements
inserted in the Bloom filter. When considering hash functions that yield equally
likely positions in the bit array, and for large enough array size m and number
of inserted elements n, the false positive rate F is [8]:
h
h
F 1 ehn/m = 1 eh/r
(1)
where r = m/n is the number of bits per element. For a fixed ratio r, minimizing
Equation 1 yields the optimal number of hash functions h 0.7r, for which F is
3
3.1
The set cF P grows with the number of false positives. To optimize memory
usage, a trade-off between the sizes of the Bloom filter and cF P is studied here.
AGC
GAG
CGC
GCT
GGA
CGA
CCG
CTA
TGG
AAA
TCC
TTG
TAT
ATT
ATC
(a)
Bloom filter
a1 ...ak
k
X
aii
mod 10
i=1
ATC
CCG
TCC
CGC
...
6
...
(b)
1
0
0
0
0
1
1
0
0
0
Nodes self-information:
!
43
e = 30 bits
dlog2
7
Structure size:
10 +
|{z}
Bloom
36
|{z}
= 28 bits
False positives
(d)
(c)
Fig. 1: A complete example of removing false positives in the probabilistic de Bruijn
graph. (a) shows S, an example de Bruijn graph (the 7 non-dashed nodes), and B, its
probabilistic representation from a Bloom filter (taking the union of all nodes). Dashed
rectangular nodes (in red in the electronic version) are immediate neighbors of S in
B. These nodes are the critical false positives. Dashed circular nodes (in green) are all
the other nodes of B; (b) shows a sample of the hash values associates to the nodes
of S (a toy hash function is used); (c) shows the complete Bloom filter associated to
S; incidentally, the nodes of B are exactly those to which the Bloom filter answers
positively; (d) describes the lower bound for exactly encoding the nodes of S (selfinformation) and the space required to encode our structure (Bloom filter, 10 bits, and
3 critical false positives, 6 bits per 3-mer).
Using the same notations as in the definition of the Bloom filter, given that
n = |S|, the size of the filter m and the false positive rate F are related through
Equation 1. The expected size of cF P is 8n F, since each node only has eight
possible extensions, which might be false positives. In the encoding of cF P ,
each k-mer occupies 2 k bits. Recall that for a given false positive rate F, the
expected optimal Bloom filter size is 1.44n log2 ( F1 ). The total structure size is
thus expected to be
1
+ (16 Fnk) bits
(2)
1.44n log2
F
|
{z
} | {z }
Bloom filter
cF P
The size is minimal for F (16k/2.08)1 . Thus, the minimal number of bits
required to store the Bloom filter and the set cF P is approximately
n (1.44 log2 (
16k
) + 2.08).
2.08
(3)
For illustration, Figure 2-(a) shows the size of the structure for various Bloom
filter sizes and k = 27. For this value of k, the optimal size of the Bloom filter
is 11.1 bits per k-mer, and the total structure occupies 13.2 bits per k-mer.
Figure 2-(b) shows that k has only a modest influence on the optimal structure
size. Note that the size of the cF P structure is in fact independent of k.
In comparison, a Bloom filter with virtually no critical false positives would
require F 8n < 1, i.e. r > 1.44 log2 (8n). For a human genome (n = 2.4 109 ), r
would be greater than 49.2, yielding a Bloom filter of size 13.7 GB.
Many NGS applications, e.g. de novo assembly of genomes [11] and transcriptomes [4], and de novo variant detection [17], rely on (i) simplifying and (ii)
traversing the de Bruijn graph. However, the graph as represented in the previous section neither supports (i) simplifications (as it is immutable) nor (ii)
traversals (as the Bloom filter cannot store an additional visited bit per node).
To address the former issue, we argue that the simplification step can be avoided
by designing a slightly more complex traversal procedure [2].
We introduce a novel, lightweight mechanism to record which portions of
the graph have already been visited. The idea behind this mechanism is that
not every node needs to be marked. Specifically, nodes that are inside simple
paths (i.e nodes having an in-degree of 1 and an out-degree of 1) will either
be all marked or all unmarked. We will refer to nodes having their in-degree
or out-degree different to 1 as complex nodes. We propose to store marking
information of complex nodes, by explicitly storing complex nodes in a separate
hash table. In de Bruijn graphs of genomes, the complete set of nodes dwarfs the
set of complex nodes, however the ratio depends on the genome complexity [7].
11.1
10
15
20
25
30
25
5
10
15
20
20
Optimal Size
13.2
100
40
60
80
(a)
20
40
60
80
100
kmer size
(b)
Fig. 2: (a) Structure size (Bloom filter, critical false positives) in function of the number
of bits per k-mer allocated to the Bloom filter (also called ratio r) for k = 32. The
trade-off that optimizes the total size is shown in dashed lines. (b) Optimal size of the
structure for different values of k.
Implementation
The de Bruijn graph structure described in this article was implemented in a new
de novo assembly software: Minia3 . An important preliminary step is to retrieve
the list of distinct k-mers that appear in the reads, i.e. true graph nodes. To
discard likely sequencing errors, only the k-mers which appear at least d times
are kept (solid k-mers). We experimentally set d to 3. Classical methods that
retrieve solid k-mers are based on hash tables [10], and their memory usage
scale linearly with the number of distinct k-mers. To avoid using more memory
than the whole structure, we implemented a constant-memory k-mer counting
procedure (manuscript in preparation). To deal with reverse-complementation,
k-mers are identified to their reverse-complements.
We implemented in Minia a graph traversal algorithm that constructs a set
of contigs (gap-less sequences). The Bloom filter and the cF P structure are used
to determine neighbors of each node. The marking structure records already
traversed nodes. A bounded-depth, bounded-breadth BFS algorithm (following
Property 2 in [2]) is performed to traverse short, locally complex regions. Specifically, the traversal ignores tips (dead-end paths) shorter than 2k + 1 nodes. It
3
chooses a single path (consistently but arbitrarily), among all possible paths that
traverse graph regions of breadth 20, provided these regions end with a single
node of depth 500. These regions are assumed to be sequencing errors, short
variants or short repetitions of length 500 bp. The breadth limit prevents
combinatorial blowup. Note that paired-end reads information is not taken into
account in this traversal. In a typical assembly pipeline (e.g. [18]), a separate
program (scaffolder ) can be used to link contigs using pairing information.
Results
Throughout the Results section, we will refer to the N50 metric of an assembly
as the longest contig size, such that half the assembly is contained in contigs
longer than this size.
6.1
To test whether the combination of the Bloom filter and the cF P structure offers an advantage over a plain probabilistic de Bruijn graph, we compared both
structures in terms of memory usage and assembly consistency. We retrieved 20
million E. coli short reads from the Short Read Archive (SRX000429), and discarded pairing information. Using this dataset, we constructed the probabilistic
de Bruijn graph, the cF P structure, and marking structure, for various Bloom
filter sizes (ranging from 5 to 19 bits per k-mer) and k = 23 (yielding 4.7 M
solid k-mers).
We measured the memory usage of both structures. For each, we performed
an assembly using Minia with exactly the same traversal procedure. The assemblies were compared to a reference assembly (using MUMmer), made with
an exact graph. The percentage of nucleotides in contigs which aligned to the
reference assembly was recorded.
Figure 3 shows that both the probabilistic de Bruijn graph and our structure
have the same optimal Bloom filter size (11 bits per k-mer, total structure size
of 13.82 bits and 13.62 per k-mer respectively). In the case of the probabilistic
de Bruijn graph, the marking structure is prominent. This is because the graph
has a significant amount of complex k-mers, most of them are linked to false
positive nodes. For the graph equipped with the cF P structure, the marking
structure only records the actual complex nodes; it occupies consistently 0.49
bits per k-mer. Both structures have comparable memory usage.
However, Figure 3 shows that the probabilistic de Bruijn graph produces
assemblies which strongly depend on the Bloom filter size. Even for large sizes,
the probabilistic graph assemblies differ by more than 3 Kbp to the reference
assembly. We observed that the majority of these differences were due to missing regions in the probabilistic graph assemblies. This is likely caused by extra
branching, which shortens the lengths of some contigs (contigs shorter than 100
bp are discarded).
Below 9 bits per k-mer, probabilistic graph assemblies significantly deteriorate. This is consistent with another article [12], which observed that when
40
30
30
20
20
10
10
Whole structure
size (bits/kmer)
Marking struct.
Bloom filter
40
11
13
15
17
19
11
13
15
17
19
4527
100
3
Differences with
exact assembly (Kbp)
50
Fig. 3: Whole structures size (Bloom filter, marking structure, and cF P if applicable)
of the probabilistic de Bruijn graph with (top right) and without the cF P structure
(top left), for an actual dataset (E. coli, k = 23). All plots are in function of the number
of bits per k-mer allocated to the Bloom filter. Additionally, the difference is shown
(bottom left and bottom right) between a reference assembly made using an exact de
Bruijn graph, and an assembly made with each structure.
the false positive rate is over 18% (i.e., the Bloom filter occupies 4 bits per
k-mer), distant nodes in the original graph become connected in the probabilistic
de Bruijn graph. To sum up, assemblies produced by the probabilistic de Bruijn
graph are prone to randomness, while those produced by our structure are exact.
6.2
de novo assembly
The N50 metric of our assembly (1.2 Kbp) is slightly above that of the other
assemblies (seconded by SOAPdenovo, 0.9 Kbp). All the programs except one
assembled 2.1 Gbp of sequences.
We furthermore assessed the accuracy of our assembly by aligning the contigs
produced by Minia to the GRCh37 human reference using GASSST [16]. Out
of the 2,090,828,207 nucleotides assembled, 1,978,520,767 nucleotides (94.6%)
were contained in contigs having a full-length alignment to the reference, with
at least 98% sequence identity. For comparison, 94.2% of the contigs assembled
by ABySS aligned full-length to the reference with 95% identity [18].
To test another recent assembler, SparseAssembler [20], the authors assembled another dataset (NA12878), using much larger effective k values. SparseAssembler stores an approximation of the de Bruijn graph, which can be compared to
a classical graph for k 0 = k + g, where g is the sparseness factor. The reported
assembly of the NA12878 individual by SparseAssembler (k + g = 56) has a
N50 value of 2.1 Kbp and was assembled using 26 GB of memory, in a day.
As an attempt to perform a fair comparison, we increased the value of k from
27 to 51 for the assembly done in Table 2 (k = 56 showed worse contiguity).
The N50 obtained by Minia (2.0 Kbp) was computed with respect to the size of
SparseAssembler assembly. Minia assembled this dataset using 6.1 GB of memory in 27 h, a 4.2 memory improvement compared to SparseAssembler.
Step
Time (h)
Memory (Gb)
k-mer counting
Enumerating positive extensions
Constructing cF P
Assembly
11.1
2.8
2.9
6.4
Overall
23.2
5.7
Table 1: Details of steps implemented in Minia, with wall-clock time and memory
usage for the human genome assembly. For constant-memory steps, memory usage was
automatically set to an estimation of the final memory size. In all steps, only one CPU
core was used.
Discussion
Method
Minia
C. & B.
ABySS
SOAPdenovo
Value of k chosen
27
27
27
25
3.49
18.6
1156
2.09
7.69
22.0
250
1.72
4.35
15.9
870
2.10
886
2.08
Nb of nodes/cores
Time (wall-clock, h)
Memory (sum of nodes, GB)
1/1
23
5.7
1/8
50
32
21/168
15
336
1/16
33
140
To the best of our knowledge, Minia is the first method that can create contigs
for a complete human genome on a desktop computer. Our method improves
the memory usage of de Bruijn graphs by two orders of magnitude compared
to ABySS and SOAPdenovo, and by roughly one order of magnitude compared
to succinct and sparse de Bruijn graph constructions. Furthermore, the current
implementation completes the assembly in 1 day using a single thread.
De Bruijn graphs have more NGS applications than just de novo assembly. We plan to port our structure to replace the more expensive graph representations in two pipelines for reference-free alternative splicing detection, and
SNP detection [14,17]. We wish to highlight three directions for improvement.
First, some steps of Minia could be implemented in parallel, e.g. graph traversal.
Second, a more succinct structure can be used to mark complex k-mers. Two
candidates are Bloomier filters [1] and minimal perfect hashing.
Third, the set of critical false positives could be reduced, by exploiting the
nature of the traversal algorithm used in Minia. The traversal ignores short tips,
and in general, graph regions that are eventually unconnected. One could then
define n-th order critical false positives (n-cF P ) as follows. An extension of a
true positive graph node is a n-cF P if and only if a breadth-first search from
the true positive node, in the direction of the extension, has at least one node
of depth n + 1. In other words, false positive neighbors of the original graph
which are part of tips, and generally local dead-end graph structures, will not be
flagged as critical false positives. This is an extension of the method presented in
this article which, in this notation, only detects 0-th order critical false positives.
Acknowledgments
The authors are grateful to Dominique Lavenier for helpful discussions and advice, and Aurelien Rizk for proof-reading the manuscript. This work benefited
from the ANR grant associated with the MAPPI project (2010-2014).
References
1. Chazelle, B., Kilian, J., Rubinfeld, R., Tal, A.: The bloomier filter: an efficient data
structure for static support lookup tables. In: Proceedings of the fifteenth annual
ACM-SIAM symposium on Discrete algorithms. pp. 3039. SIAM (2004)
2. Chikhi, R., Lavenier, D.: Localized genome assembly from reads to scaffolds: practical traversal of the paired string graph. Algorithms in Bioinformatics pp. 3948
(2011)
3. Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large
genomes. Bioinformatics 27(4), 479 (2011)
4. Grabherr, Manfred G, e.a.: Full-length transcriptome assembly from RNA-Seq data
without a reference genome. Nat Biotech 29(7), 644652 (Jul 2011),
5. Idury, R.M., Waterman, M.S.: A new algorithm for DNA sequence assembly. Journal of Computational Biology 2(2), 291306 (1995)
6. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and
genotyping of variants using colored de bruijn graphs. Nature Genetics (2012)
7. Kingsford, C., Schatz, M.C., Pop, M.: Assembly complexity of prokaryotic genomes
using short reads. BMC bioinformatics 11(1), 21 (2010)
8. Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: Building a better
bloom filter. AlgorithmsESA 2006 pp. 456467 (2006)
9. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G.,
Kristiansen, K.: De novo assembly of human genomes with massively parallel short
read sequencing. Genome research 20(2), 265 (2010)
10. Marais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting
of occurrences of k-mers. Bioinformatics 27(6), 764770 (2011),
11. Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315327 (2010)
12. Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.:
Scaling metagenome sequence assembly with probabilistic de bruijn graphs. Arxiv
preprint arXiv:1112.4193 (2011)
13. Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics 27(13), i94i101 (2011)
14. Peterlongo, P., Schnel, N., Pisanti, N., Sagot, M.F., Lacroix, V.: Identifying SNPs
without a reference genome by comparing raw reads. In: String Processing and
Information Retrieval. pp. 147158. Springer (2010)
15. Peterlongo, P., Chikhi, R.: Mapsembler, targeted and micro assembly of large NGS
datasets on a desktop computer. BMC Bioinformatics (1), 48 (2012)
16. Rizk, G., Lavenier, D.: GASSST: global alignment short sequence search tool.
Bioinformatics 26(20), 2534 (2010)
17. Sacomoto, G., Kielbassa, J., Chikhi, R., Uricaru, R., Antoniou, P., Sagot, M.,
Peterlongo, P., Lacroix, V.: KISSPLICE: de-novo calling alternative splicing events
from RNA-seq data. BMC Bioinformatics 13(Suppl 6), S5 (2012),
18. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J.M., Birol, .:
ABySS: a parallel assembler for short read sequence data. Genome Research 19(6),
11171123 (2009),
19. Warren, R.L., Holt, R.A.: Targeted assembly of short sequence reads. PloS one
6(5), e19816 (2011)
20. Ye, C., Ma, Z., Cannon, C., Pop, M., Yu, D.: Exploiting sparseness in de novo
genome assembly. BMC Bioinformatics 13(Suppl 6), S1 (2012),