Journal of Discrete Algorithms: Sergio de Agostino
Journal of Discrete Algorithms: Sergio de Agostino
a r t i c l e i n f o a b s t r a c t
Article history: The greedy approach to dictionary-based static text compression can be executed by a
Available online 27 May 2015 finite-state machine. When it is applied in parallel to different blocks of data independently,
there is no lack of robustness even on standard large scale distributed systems with input
Keywords:
files of arbitrary size. Beyond standard large scale, a negative effect on the compression
Distributed algorithms
Scalability
effectiveness is caused by the very small size of the data blocks. A robust approach for
Robustness extreme distributed systems is presented in this paper, where this problem is fixed by
Static lossless compression overlapping adjacent blocks and preprocessing the neighborhoods of the boundaries.
© 2015 Elsevier B.V. All rights reserved.
1. Introduction
Static data compression implies the knowledge of the input type. With text, dictionary-based techniques are particularly
efficient and employ string factorization. The dictionary comprises typical factors plus the alphabet characters in order to
guarantee feasible factorizations for every string. Factors in the input string are substituted by pointers to dictionary copies
and such pointers could be either variable or fixed length codewords. The optimal factorization is the one providing the best
compression, that is, the one minimizing the sum of the codeword lengths. Efficient sequential algorithms for computing
optimal solutions were provided by means of dynamic programming techniques [32] or by reducing the problem to the one
of finding a shortest path in a directed acyclic graph [29]. From the point of view of sequential computing, such algorithms
have the limitation of using an off-line approach. However, decompression is still on-line and a very fast and simple real
time decoder outputs the original string with no loss of information. Therefore, optimal solutions are practically acceptable
for read-only memory files where compression is executed only once. Differently, simpler versions of dictionary-based static
techniques were proposed which achieve nearly optimal compression in practice (that is, less than ten percent loss). An
important simplification is to use a fixed length code for the pointers, so that the optimal decodable compression for this
coding scheme is obtained by minimizing the number of factors. Such a variable to fixed length approach is robust since
the dictionary factors are typical patterns of the input specifically considered. The problem of minimizing the number of
factors gains a relevant computational advantage by assuming that the dictionary is prefix-closed (suffix-closed), that is, all
the prefixes (suffixes) of a dictionary element are dictionary elements [4,7,20]. The left to right greedy approach is optimal
only with suffix-closed dictionaries. An optimal factorization with prefix-closed dictionaries can be computed on-line by
using a semi-greedy procedure [7,20]. On the other hand, prefix-closed dictionaries are easier to build by standard adaptive
heuristics [2,31]. These heuristics are based on an “incremental” string factorization procedure [24,34]. The most popular
for prefix-closed dictionaries is the one presented in [33]. However, the prefix and suffix properties force the dictionary to
include many useless elements which increase the pointer size and slightly reduce the compression effectiveness. A more
http://dx.doi.org/10.1016/j.jda.2015.05.001
1570-8667/© 2015 Elsevier B.V. All rights reserved.
S. De Agostino / Journal of Discrete Algorithms 34 (2015) 54–61 55
natural dictionary with no prefix and no suffix property is the one built by the heuristic in [27] or by means of separator
characters as, for example, space, new line and punctuation characters with natural language.
Theoretical work was done, mostly in the nineties, to design efficient parallel algorithms on a random access parallel
machine (PRAM) for dictionary-based static text compression [3,8–10,16,21,22,28,30]. Although the PRAM model is out of
fashion today, shared memory parallel machines offer a good computational model for a first approach to parallelization.
When we address the practical goal of designing distributed algorithms we have to consider two types of complexity, the
interprocessor communication and the input–output mechanism. While the input/output issue is inherent to any parallel
algorithm and has standard solutions, the communication cost of the computational phase after the distribution of the data
among the processors and before the output of the final result is obviously algorithm-dependent. So, we need to limit the
interprocessor communication and involve more local computation to design a practical algorithm. The simplest model for
this phase is, of course, a simple array of processors with no interconnections and, therefore, no communication cost. Parallel
decompression is, obviously, possible on this model [10]. With parallel compression, the main issue is the one concerning
scalability and robustness. Traditionally, the scale of a system is considered large when the number of nodes has the order
of magnitude of a thousand. Modern distributed systems may nowadays consist of hundreds of thousands of nodes, pushing
scalability well beyond traditional scenarios (extreme distributed systems).
In [1] an approximation scheme of optimal compression with static prefix-closed dictionaries was presented for massively
parallel architectures, using no interprocessor communication during the computational phase since it is applied in parallel
to different blocks of data independently. The scheme is algorithmically related to the semi-greedy approach previously
mentioned and implementable on extreme distributed systems because adjacent blocks overlap and the neighborhoods of
the boundaries are preprocessed. However, with standard large scale the overlapping of the blocks and the preprocessing of
the boundaries are not necessary to achieve nearly optimal compression in practice. Furthermore, the greedy approach to
dictionary-based static text compression is nearly optimal on realistic data for any kind of dictionary even if the theoretical
worst-case analysis shows that the multiplicative approximation factor with respect to optimal compression achieves the
maximum length of a dictionary element [31]. If the dictionary is well-constructed by relaxing the prefix property, the loss
of greedy compression can go down to one percent with respect to the optimal one. In this paper, we relax the prefix
property of the dictionary and present two implementations of the greedy approach to static text compression with an
arbitrary dictionary on a large scale and an extreme distributed system, respectively. Moreover, we present a finite-state
machine implementation of greedy static dictionary-based compression with an arbitrary dictionary that can be relevant to
achieve high speed with standard scale distributed systems. We wish to point out that scalability cannot be guaranteed with
adaptive dictionary approaches to data compression, as the sliding window method [25] or the dynamic one [34]. Indeed,
the size of the data blocks over the distributed memory of a parallel system must be at least a few hundreds kilobytes
in both cases, that is, robustness is guaranteed with scalability only with very large files [3,12,13]. This is still true with
improved variants employing either fixed-length codewords [26,6] or variable-length ones [5,17–19,23].
In Section 2 we describe the different approaches to dictionary-based static text compression. The previous work on
parallel approximations of optimal compression with prefix-closed dictionaries is given in Section 3. Section 4 shows the
finite-state machine and the two implementations of the greedy approach for arbitrary dictionaries. Experiments are dis-
cussed in Section 5. Conclusions and future work are given in Section 6.
As mentioned in the introduction, the dictionary comprises typical factors (including the alphabet characters) associated
with fixed or variable length codewords. The optimal factorization is the one minimizing the sum of the codeword lengths
and sequential algorithms for computing optimal solutions were provided by means of dynamic programming techniques
[32] or by reducing the problem to the one of finding a shortest path in a directed acyclic graph [29]. When the codewords
are fixed-length, with suffix-closed dictionaries we obtain optimality by means of a simple left to right greedy approach,
that is, advancing with the on-line reading of the input string by selecting the longest matching factor with a dictionary
element. Such a procedure can be computed in real time by storing the dictionary in a trie data structure. If the dictionary
is prefix-closed, there is an optimal semi-greedy factorization which is computed by the procedure of Fig. 1 [7,20]. At each
step, we select a factor such that the longest match in the next position with a dictionary element ends to the rightest.
Since the dictionary is prefix-closed, the factorization is optimal. The algorithm can even be implemented in real time with
a modified trie data structure [20].
The semi-greedy factorization can be generalized to any dictionary by considering only those positions, among the ones
covered by the current factor, next to a prefix that is a dictionary element [7]. The generalized semi-greedy factorization
procedure is not optimal while the greedy one is not optimal even when the dictionary is prefix-closed. The maximum
length of a dictionary element is an obvious upper bound to the multiplicative approximation factor of any string factor-
ization procedure with respect to the optimal solution. We show in the next theorems that this upper bound is tight for
the greedy and semi-greedy procedures when the dictionary is arbitrary [31] and that such tightness is kept by the greedy
procedure even if the dictionary is prefix-closed.
Theorem 2.1. There exists an infinite family of dictionaries such that the greedy and semi-greedy procedures produce (m) approxi-
mations of the optimal factorization of an input string in the worst case, where m is the maximum length of a dictionary element.
56 S. De Agostino / Journal of Discrete Algorithms 34 (2015) 54–61
Proof. Let baban be the input string and let {a, b, bab, ban } be the dictionary, for any positive integer n. Then, the optimal
factorization is b, a, ban while bab, a, a, · · · , a, · · · , a is the factorization obtained whether the greedy or the semi-greedy
procedure is applied. Since m = n + 1, the statement of the theorem follows. 2
Corollary 2.1. There exists an infinite family of prefix-closed dictionaries such that the greedy procedure produces a (m) approxima-
tion of the optimal factorization of an input string in the worst case.
Proof. Let baban be the input string and let {a, b, ba, bab, bak : 2 ≤ k ≤ n} be the prefix-closed dictionary, for any positive
integer n. Then, the optimal factorization ba, ban is computed by the semi-greedy approach while the greedy factorization
is bab, a, a, · · · , a, · · · , a as in Theorem 2.1. Again, the statement of the theorem follows since m = n + 1. 2
We wish to point out the artificiality of the examples proving the tightness of a large upper bound to the approximation
cost of the greedy approach with respect to optimal factorization. Indeed, in practice greedy factorizations are nearly opti-
mal. Intuitively, this is based on one observation concerning realistic data. If a greedy choice is made in a position of the
string, which is “natural” to start a new factor, such choice produces a “good” result and very likely the next greedy choice
will produce it, too. With sequential computing, this kind of inductive reasoning applies since the first position of the string
is always a “natural” one to start a factorization process. This is also the reason why we can relax the prefix property of the
dictionary. When discussing the distributed memory implementations in the next sections, we have to face the problem that
the input string is partitioned into blocks of fixed length and each of them is broadcasted to a different processor. Except
for the processor receiving the first block (a prefix of the input string), the other ones might start the greedy process in
“bad” positions. Overcoming this disadvantage is possible either by processing data blocks sufficiently long (standard large
scale case) or by preprocessing the boundaries between adjacent blocks (extreme case).
3. Previous work
Given an arbitrary dictionary, for every integer k greater than 1 there is an O(km) time, O(n/km) processors distributed
algorithm factorizing an n-characters long input string S with a cost which approximates the cost of the optimal factor-
ization within the multiplicative factor (k + m − 1)/k [3]. However, with prefix-closed dictionaries a better approximation
scheme was presented in [1], producing a factorization of S with a cost approximating the cost of the optimal factorization
within the multiplicative factor (k + 1)/k in O(km) time with O(n/km) processors. This second approach was designed for
massively parallel architecture and is suitable for extreme distributed systems, when the scale is beyond standard large
values. On the other hand, the first approach applies to standard small, medium and large scale systems. Both approaches
provide approximation schemes for the corresponding factorization problems since the multiplicative approximation fac-
tors converge to 1 when km converge to n. Indeed, in both cases compression is applied in parallel to different blocks
of data independently. Beyond standard large scale, adjacent blocks overlap and the neighborhoods of the boundaries are
preprocessed.
To decode the compressed files on a distributed system, it is enough to use a special mark occurring in the sequence
of pointers each time the coding of a block ends. The input phase distributes the subsequences of pointers coding each
block among the processors. Since a copy of the dictionary is stored in every processor, the decoding of the blocks is
straightforward.
In the following two subsections, we describe the two approaches. Then, how to speed up the preprocessing phase of
the second approach is described in the last subsection. The next section will argue that we can relax the requirement
of computing a theoretical approximation of optimal compression since, in practice, the greedy approach is nearly optimal
on data blocks sufficiently long. On the other hand, when the blocks are too short because the scale of the distributed
system is beyond standard values, the overlapping of the adjacent blocks and the preprocessing of the neighborhoods of the
boundaries are necessary to guarantee the robustness of the greedy approach. Consequently, also the prefix property of the
dictionary can be relaxed.
We simply apply in parallel optimal compression to blocks of length km [3]. Every processor stores a copy of the dictio-
nary. For an arbitrary dictionary, we execute the dynamic programming procedure computing the optimal factorization of a
S. De Agostino / Journal of Discrete Algorithms 34 (2015) 54–61 57
string in linear time [32] (the procedure in [29] is pseudo-linear for fixed-length coding and, even, super-linear for variable
length). Obviously, this works for prefix- and suffix-closed dictionaries as well and, in any case, we know the semi-greedy
and greedy approach are implementable in linear time. It follows that the algorithm requires O(km) time with n/km pro-
cessors and the multiplicative approximation factor is (k + m − 1)/k with respect to any factorization. Indeed, when the
boundary cuts a factor the suffix starting the block and its substrings might not be in the dictionary. Therefore, the multi-
plicative approximation factor follows from the fact that m − 1 is the maximum length for a proper suffix as shown in Fig. 2
(sequence of plus signs in parentheses). If the dictionary is suffix-closed, the multiplicative approximation factor is (k + 1)/k
since each suffix of a factor is a factor.
The approximation scheme is suitable only for standard scale systems unless the file size is very large. In effect, the
block size must be the order of kilobytes to guarantee robustness. Beyond standard large scale, overlapping of adjacent
blocks and a preprocessing of the boundaries are required. We will see, in the next subsection, the approximation scheme
for prefix-closed dictionaries.
With prefix-closed dictionaries a better approximation scheme was presented in [1]. During the input phase blocks of
length m(k + 2), except for the first one and the last one which are m(k + 1) long, are broadcasted to the processors. Each
block overlaps on 2m characters with the adjacent block to the left and to the right, respectively (obviously, the first one
overlaps only to the right and the last one only to the left).
We call a boundary match a factor covering positions in the first and second half of the 2m characters shared by two
adjacent blocks. The processors execute the following algorithm to compress each block:
• for each block, every corresponding processor but the one associated with the last block computes the boundary match
between its block and the next one ending furthest to the right, if any;
• each processor computes the optimal factorization from the beginning of its block to the beginning of the boundary
match on the right boundary of its block (or the end of its block if there is no boundary match).
Stopping the factorization of each block at the beginning of the right boundary match might cause the making of a
surplus factor, which determines the multiplicative approximation factor (k + 1)/k with respect to any other factorization.
Indeed, as it is shown in Fig. 3, the factor in front of the right boundary match (sequence of x’s) might be extended to be
a boundary match itself (sequence of plus signs) and to cover the first position of the factor after the boundary (dotted
line). Then, the approximation scheme produces a factorization of S with a cost approximating the cost of the optimal
factorization within the multiplicative factor (k + 1)/k in O(km) time with O(n/km) processors (we will see in the next
subsection how the preprocessing can be executed in O(m) time).
In [1], it is shown experimentally that for k = 10 the compression ratio achieved by such factorization is about the same
as the sequential one. Considering that typically the average match length is 10, one processor can compress down to 100
bytes independently and this is why the approximation scheme was presented for massively parallel architecture.
The parallel running time of the preprocessing phase computing the boundary matches is O(m2 ) by brute force. To lower
the complexity to O(m), an augmented trie data structure is needed [14]. For each node v of the trie, let f be the dictionary
element corresponding to v and a an alphabet character not represented by an edge outcoming from v. Then, we add an
edge from v to w with label a, where w represents the longest proper suffix of fa in the dictionary. Each processor has
a copy of this augmented trie data structure and first preprocess the 2m characters overlapped by the adjacent block on
the left boundary and, secondly, the ones on the right boundary. In each of these two sub-phases, the processors advance
with the reading of the 2m characters from left to right, starting from the first one while visiting the trie starting from
the root and using the corresponding edges. A temporary variable t 2 stores the position of the current character during the
preprocessing while another temporary variable t 1 is, initially, equal to t 2 . When an added edge of the augmented structure
is visited, the value t = t 2 − d + 1 is computed where d is the depth of the node reached by such edge. If t is a position
58 S. De Agostino / Journal of Discrete Algorithms 34 (2015) 54–61
in the first half of the 2m characters, then t 1 is updated by changing its value to t. Else, the procedure stops and t 2 is
decreased by 1. If t 2 is a position in the second half of the 2m characters then t 1 and t 2 are the first and last position of a
boundary match, else there is no boundary match.
We provide a finite-state machine implementation of the greedy approach with an arbitrary dictionary. Then, we show
the two implementations on standard large scale and extreme distributed systems.
We show the finite-state machine implementation producing the on-line greedy factorization of a string with an arbitrary
dictionary. The most general formulation for a finite-state machine M is to define it as a sixtuple ( A , B , Q , δ, q0 , F ) with
an input alphabet A, an output alphabet B, a set of states Q , a transition function δ : Q × A → Q × B ∗ , an initial state q0
and a set of accepting states F ⊆ Q . The trie storing the dictionary is a subgraph of the finite-state machine diagram. It
is well known that each dictionary element is represented as a path from the root to a node of the trie where edges are
labeled with an alphabet character (the root representing the empty string). The edges are directed from the parent to the
child and the set of nodes represent the set of states of the machine. The output alphabet is binary and the factorization
is represented by a binary string having the same length as the input string. The bits of the output string equal to 1 are
those corresponding to the positions where the factors start. Since every string can be factorized, every state is accepting.
The root represents the initial state. We need only to complete the function δ , by adding the missing edges of the diagram.
The empty string is associated as output to the edges in the trie. For each node, the outcoming edges represent a subset of
the input alphabet. Let f be the string (or dictionary element) corresponding to the node v in the trie and a an alphabet
character not represented by an edge outcoming from v. Let f a = f 1 · · · f k be the on-line greedy factorization of fa and i
the smallest index such that f i +1 · · · f k is represented by a node w in the trie. Then, we add to the trie a directed edge
from v to w with label a. The output associated with the edge is the binary string representing the sequence of factors
f 1 · · · f i . By adding such edges, the machine is entirely defined. Redefining the machine to produce the compressed form of
the string is straightforward.
Since greedy factorization is nearly optimal in practice, as a first approach we simply apply in parallel left to right
greedy compression to blocks of length km. However, at the end of Section 2 we pointed out that a processor might start
the greedy process in a “bad” position. Therefore, data blocks must be sufficiently long to overcome this disadvantage. The
order of kilobytes for the block size is enough to guarantee robustness and requires a standard large scale system. Each
of the O(n/km) processors could apply the finite-state machine implementation to its block. Beyond standard large scale,
blocks are too short and a preprocessing of the boundaries is required to identify the “good” positions by overlapping
adjacent blocks. Once this is done, we can relax the requirement of computing a theoretical approximation of optimal
compression by applying the greedy process rather than the semi-greedy one. Moreover, we can relax the prefix property
of the dictionary in both approaches. During the input phase overlapping blocks of length m(k + 2) are broadcasted to the
processors as in the previous section. On the other hand, the definition of boundary match is extended to those factors,
which are suffixes of the first half of the 2m characters shared by two adjacent blocks. The procedure is the following:
• for each block, every corresponding processor but the one associated with the last block computes the longest boundary
match between its block and the next one;
• each processor computes the greedy factorization from the end of the boundary match on the left boundary of its block
to the beginning of the boundary match on the right boundary.
The following theorem proves that the above procedures are not approximation schemes from a theoretical point of view.
Theorem 4.1. There exists an infinite family of dictionaries such that the distributed implementations of the greedy approach with or
without boundary matches produce (m) approximations of the optimal factorization of an input string in the worst case.
Proof. Let i =1 kj=1 abam−2 be the input string and let {a, b, ab, bam−2 , bm } be the dictionary, for any positive integer n.
n/km
With an O(n/km) processors distributed algorithm, each processor factorizes a block equal to kj=1 abam−2 if there is no
overlapping. Otherwise, the block is extended on both boundaries by abam−2 and every boundary match is equal to a.
In conclusion, every sub-block abam−2 is factorized into ab, a, a, · · · , a, · · · , a by both implementations. On the other hand,
a, bam−2 provides the optimal factorization of the input string and the statement of the theorem follows. 2
S. De Agostino / Journal of Discrete Algorithms 34 (2015) 54–61 59
Corollary 4.1. There exists an infinite family of prefix-closed dictionaries such that the distributed implementations of the greedy
approach with or without boundary matches preprocessing produce (m) approximations of the optimal factorization of an input
string in the worst case.
Proof. The same proof of Theorem 4.1 works with the prefix-closed dictionary {a, b, ab, bak , bk : 1 ≤ k ≤ m − 2, bm−1 , bm }. 2
Again, these examples are artificial as for the sequential case. In practice, the distributed implementation preprocessing
the boundary matches is nearly optimal for k = 10, as the approximation scheme of previous section. Therefore, the com-
pression ratio achieved is about the same as the sequential one with blocks of 100 bytes and the approach, presented in
this section, is obviously suitable for extreme distributed systems. Indeed, with a file size of several megabytes or more, the
system scale has a greater order of magnitude than the standard large scale parameter. We wish to point out that the com-
putation of the boundary matches is very relevant for the compression effectiveness when an extreme distributed system
is employed since the sub-block length becomes much less than 1K. With standard large scale systems the block length
is several kilobytes with just a few megabytes to compress and the approach using boundary matches is too conservative.
Basically, increasing the size of the block by one order of magnitude guarantees the robustness of the standard approach.
Obviously, these arguments for both approaches assume that the dictionary is well-constructed and, consequently, it does
not have to be prefix-closed. Finally, after preprocessing on an extreme distributed system, each of the O(n/km ) processors
could apply the finite-state machine implementation to its block. However, blocks are so short that it becomes irrelevant.
On the other hand, with standard scale systems and very large size files the application of the finite-state machine to the
distributed blocks plays an important role to achieve high speed.
To lower the parallel running time of the preprocessing phase to O(m), the same augmented trie data structure, described
in the previous section, is needed but, in this case, the boundary matches are the longest ones rather than the ones ending
furthest to the right. Then, besides the temporary variables t 1 and t 2 , employed by the preprocessing phase described in the
previous section, two more variables τ1 and τ2 are required and, initially, equal to t 1 and t 2 . Each time t 1 must be updated
by such preprocessing phase, the value t 2 − t 1 + 1 is compared with τ2 − τ1 before updating. If it is greater or τ2 is smaller
than the last position of the first half of the 2m characters, τ1 and τ2 are set equal to t 1 and t 2 − 1. Then, t 1 is updated. At
the end of the procedure, τ1 and τ2 are the first and last positions of the longest boundary match. We wish to point out
that there is always a boundary match that is computed, since the final value of τ2 always corresponds to a position equal
either to one in the second half of the 2m characters or to the last position of the first half.
5. Experimental results
Suffix-closed and prefix-closed dictionaries have been considered in static data compression because they are constructed
by the LZ77 [25] and LZ78 [34] adaptive compression methods, when reading a typical string of a given source of data.
When the input string to compress matches the characteristics of a dictionary given in advance and already filled with
typical factors, the advantage in terms of compression efficiency is obvious. However, the bounded size of the dictionary
(typically, 216 factors) and its static nature imply a lack of robustness and the adaptive methods might result more effective
in some cases, even if the type of data is known and the dictionary is very well constructed. We experimented this with
the “compress” command line on the Unix and Linux platforms, which is the implementation of a variant of the LZ78
method, called the LZC method. LZC builds a prefix-closed dictionary of 216 factors while compressing the data. When
the dictionary is full, it applies static dictionary greedy compression monitoring at the same time the compression ratio.
When the compression ratio starts deteriorating, it clears the dictionary and restarts dynamic compression alternating, in
this way, adaptive and non-adaptive compression. We experimented that, when compressing megabytes of English text with
a static prefix-closed dictionary optimally, there might be up to a ten percent loss in comparison with the compression
ratio of the LZC method [1]. However, as we pointed out earlier, there is no scalable and robust implementation of the
LZC method on a distributed memory system (except for the static phase of the method as shown in [13]), while a nearly
optimal compression distributed algorithm is possible with no scalability and robustness issues if we accept a ten percent
compression loss as a reasonable upper bound to the price to pay for it [1].
A prefix-closed dictionary D in [1] was filled up with 216 elements, starting from the alphabet (each of the 256 bytes).
Then, for each of the most common substrings listed in [31], every prefix of length less or equal to ten was added to D.
On the other hand, for each string with no capital letters and less than eleven characters in the Unix dictionary of words,
we added every prefix of length less or equal to six. For every word in the Unix dictionary inserted in D, a space was
concatenated at the end of the copy in D. Another copy ending with the new line character was inserted if the word length
is less than six. Finally, it was enough to add a portion of the words with six characters plus a new line character to fill
up D.
The average optimal compression ratio we obtained with this dictionary is 0.51, while the greedy one is even 0.57. On
the other hand, the LZC average compression ratio is 0.42. It turned out that both gaps are consistently reduced when the
prefix property of the dictionary is relaxed. A not prefix-closed dictionary D was filled up with 216 elements, starting from
60 S. De Agostino / Journal of Discrete Algorithms 34 (2015) 54–61
the alphabet and the 477 most common substrings listed in [31]. Then, we added each string with no capital letters and
less than ten characters from the Unix dictionary of words. Again, for every word in the Unix dictionary inserted in D ,
a space was concatenated at the end of the copy in D . Finally, it was enough to add a portion of short words with a
new line character at the end to fill up D . With such a dictionary, the loss on the compression ratio goes down from ten
to five percent with respect to the adaptive LZC compression. Moreover, the greedy approach has just a one percent loss
with respect to optimal, as shown in Fig. 4. This is because the dictionary is better constructed. In Fig. 4, we also show
the compression effectiveness results for the two approaches with or without boundaries preprocessing (that is, for an
extreme or a standard distributed system). The two approaches perform similarly and have a one percent loss with respect
to sequential greedy, whether the dictionary is prefix-closed or not.
We observed in the introduction that for read-only memory files, speeding up decompression is what really matters in
practice. In this context, the results presented in this paper suggest a dynamic approach (that is, working for any type of
input), where the dictionary is not given in advance but learned from the input string and, then, used statically to compress
the string. This models a scheme where compression is performed only once with an off-line sequential procedure reading
the string twice from left to right in such a way that decompression can be parallelized with no scalability issues. The first
left-to-right reading is to learn the dictionary and better ways than the LZC algorithm exist since the dictionary provided
by LZC, after reading the entire string, is constructed from a relatively short suffix of the input. A much more sophisticated
approach employs the LRU (least recently used) strategy [31] to a dictionary learned by the heuristic in [27]. Such dictionary
is not prefix-closed and with such strategy, after the dictionary is filled up, elements are removed in a continuous way
by deleting at each step of the factorization the least recently used factor which is not a proper prefix of another one.
A relaxed version of this approach was presented in [15], that is easier to implement, and experimental results show that
the compression ratio with this type of dictionary goes down to 0.32 for English text [11]. This performance is kept if the
greedy approach is applied statically during the second reading of the string, using the dictionary obtained from the first
reading. Moreover, if the compression is applied independently to different blocks of data of 1 Kb or to smaller blocks after
the boundaries preprocessing, there is still just a one percent loss on the compression ratio. On the other hand, we pay
a very small price with the transmission to the decoder of a header containing the information on the dictionary learned
during the first reading of the input string, since the size is much smaller than one megabyte.
6. Conclusion
We considered the problem of speeding up file compression on a distributed system when the input type is known.
A straightforward implementation of greedy dictionary-based static text compression, suitable for a standard large scale
system, distributes data blocks among the processors to be compressed independently. This approach can, obviously, be
applied when the system is arbitrarily scaled down. In order to push scalability beyond what is traditionally considered a
large scale system, a more involved approach distributes overlapping blocks to compute boundary matches. These boundary
matches are relevant to maintain the compression effectiveness. If we have a standard small, medium or large scale sys-
tem available, the approach with boundary matches is too conservative. The absence of a communication cost during the
computation in both implementations guarantees a linear speed-up in some cases. Moreover, a finite-state machine imple-
mentation of sequential greedy dictionary-based static text compression was shown, which speeds up the execution of the
distributed algorithm in a relevant way when the data blocks are large, that is, when the size of the input file is large and
the size of the distributed system is relatively small. As future work, experiments on parallel running times should be done
to see the effects of the preprocessing phase on the speed-up. This can be figured out using a small parallel machine even if
the application is for systems beyond the standard size. Another experiment is to see how relevant the finite-state machine
implementation is when the size of the data blocks decreases.
References
[1] D. Belinskaya, S. DeAgostino, J.A. Storer, Near optimal compression with respect to a static dictionary on a practical massively parallel architecture, in:
Proceedings of the IEEE Data Compression Conference, 1995, pp. 172–181.
[2] T.C. Bell, J.G. Cleary, I.H. Witten, Text Compression, Prentice Hall, 1990.
[3] L. Cinque, S. DeAgostino, L. Lombardi, Scalability and communication in parallel low-complexity lossless compression, Math. Comput. Sci. 3 (2010)
391–406.
[4] M. Cohn, R. Khazan, Parsing with suffix and prefix dictionaries, in: Proceedings of the IEEE Data Compression Conference, 1996, pp. 180–189.
[5] M. Crochemore, L. Gianbruno, A. Langiu, F. Mignosi, A. Restivo, Dictionary-symbolwise flexible parsing, J. Discrete Algorithms 14 (2012) 74–90.
[6] M. Crochemore, A. Langiu, F. Mignosi, Note on the greedy parsing optimality for dictionary-based text compression, Theor. Comput. Sci. 525 (2014)
55–59.
[7] M. Crochemore, W. Rytter, Jewels of Stringology, World Scientific, 2003.
[8] S. DeAgostino, Sub-linear algorithms and complexity issues for lossless data compression, Master’s thesis, Brandeis University, 1994.
S. De Agostino / Journal of Discrete Algorithms 34 (2015) 54–61 61
[9] S. DeAgostino, Parallelism and data compression via textual substitution, PhD dissertation, Sapienza University of Rome, 1995.
[10] S. DeAgostino, Parallelism and dictionary-based data compression, Inf. Sci. 135 (2001) 43–56.
[11] S. DeAgostino, Bounded size dictionary compression: relaxing the LRU deletion heuristic, in: Proceedings of the Prague Stringology Conference, 2005,
pp. 135–142.
[12] S. DeAgostino, Parallel implementations of dictionary text compression without communication, in: London Stringology Days, 2009.
[13] S. DeAgostino, LZW data compression on large scale and extreme distributed system, in: Proceedings of the Prague Stringology Conference, 2012,
pp. 18–27.
[14] S. DeAgostino, The greedy approach to dictionary-based static text compression on a distributed system, in: Proceedings of the International Conference
on Advances Engineering Computing with Applications to Sciences, 2014, pp. 1–6.
[15] S. DeAgostino, R. Silvestri, Bounded size dictionary compression: SCk -completeness and NC algorithms, Inf. Comput. 180 (2003) 101–112.
[16] S. DeAgostino, J.A. Storer, Parallel algorithms for optimal compression using dictionaries with the prefix property, in: Proceedings of the IEEE Data
Compression Conference, 1992, pp. 52–61.
[17] A. Farrugia, P. Ferragina, A. Frangioni, R. Venturini, Bicriteria data compression, in: Proceedings of the SIAM–ACM Symposium on Discrete Algorithms,
2014, pp. 1582–1585.
[18] P. Ferragina, I. Nitto, R. Venturini, On optimally partitioning a text to improve its compression, Algorithmica 61 (2011) 51–74.
[19] P. Ferragina, I. Nitto, R. Venturini, On the bit-complexity of Lempel–Ziv compression, SIAM J. Comput. 42 (2013) 1521–1541.
[20] A. Hartman, M. Rodeh, Optimal parsing of strings, in: A. Apostolico, Z. Galil (Eds.), Combinatorial Algorithms on Words, Springer, 1985, pp. 155–167.
[21] D.S. Hirschberg, L.M. Stauffer, Parsing algorithms for dictionary compression on the pram, in: Proceedings of the IEEE Data Compression Conference,
1994, pp. 136–145.
[22] D.S. Hirschberg, L.M. Stauffer, Dictionary compression on the pram, Parallel Process. Lett. 7 (1997) 297–308.
[23] A. Langiu, On parsing optimality for dictionary-based text compression – the zip case, J. Discrete Algorithms 20 (2013) 65–70.
[24] A. Lempel, J. Ziv, On the complexity of finite sequences, IEEE Trans. Inf. Theory 22 (1976) 75–81.
[25] A. Lempel, J. Ziv, A universal algorithm for sequential data compression, IEEE Trans. Inf. Theory 23 (1977) 337–343.
[26] Y. Matias, C.S. Sahinalp, On the optimality of parsing in dynamic dictionary-based data compression, in: Proceedings of the SIAM–ACM Symposium on
Discrete Algorithms, 1999, pp. 943–944.
[27] V.S. Miller, M.N. Wegman, Variations on theme by Ziv–Lempel, in: A. Apostolico, Z. Galil (Eds.), Combinatorial Algorithms on Words, Springer, 1985,
pp. 131–140.
[28] H. Nagumo, M. Lu, K. Watson, Parallel algorithms for the static dictionary compression, in: Proceedings of the IEEE Data Compression Conference, 1995,
pp. 162–171.
[29] E.J. Shoegraf, H.S. Heaps, A comparison of algorithms for data base compression by use of fragments as language elements, Inf. Storage Retr. 10 (1974)
309–319.
[30] L.M. Stauffer, D.S. Hirschberg, Pram algorithms for static dictionary compression, in: Proceedings of the International Symposium on Parallel Processing,
1994, pp. 344–348.
[31] J.A. Storer, Data Compression: Methods and Theory, Academic Press, 1988.
[32] R.A. Wagner, Common phrases and minimum text storage, Commun. ACM 16 (1973) 148–152.
[33] T.A. Welch, A technique for high-performance data compression, Computer 17 (1984) 8–19.
[34] J. Ziv, A. Lempel, Compression of individual sequences via variable-rate coding, IEEE Trans. Inf. Theory 24 (1978) 530–536.