Paper 3
Paper 3
Abstract—Since today’s scientific applications are producing vast amounts of data, compressing them before storage/transmission is
critical. Results of existing compressors show two types of HPC data sets: highly compressible and hard to compress. In this work, we
carefully design and optimize the error-bounded lossy compression for hard-to-compress scientific data. We propose an optimized
algorithm that can adaptively partition the HPC data into best-fit consecutive segments each having mutually close data values, such
that the compression condition can be optimized. Another significant contribution is the optimization of shifting offset such that the
XOR-leading-zero length between two consecutive unpredictable data points can be maximized. We finally devise an adaptive method
to select the best-fit compressor at runtime for maximizing the compression factor. We evaluate our solution using 13 benchmarks
based on real-world scientific problems, and we compare it with 9 other state-of-the-art compressors. Experiments show that our
compressor can always guarantee the compression errors within the user-specified error bounds. Most importantly, our optimization
can improve the compression factor effectively, by up to 49 percent for hard-to-compress data sets with similar compression/
decompression time cost.
Index Terms—Error-bounded lossy compression, floating-point data compression, high performance computing, scientific simulation
Ç
1 INTRODUCTION
In this work, we present a new error-bounded lossy com- we summarize our conclusions and briefly disucss future
pressor for hard-to-compress data sets with three significant work.
contributions:
2 PROBLEM FORMULATION
We carefully analyze what kinds of data sets are
hard to compress. Specifically, we find that hard-to- In this work, we focus mainly on how to compress the hard-
compress data lead to similar compression levels (or to-compress HPC data. None of lossy compressors can
the same order of magnitude of compression factors) achieve high compression factors on these simulation data
with different lossy compressors, and they are gener- (usually stored in the form of a floating-point array). As pre-
ally hard to be approximated accurately by curve-fit- sented in our previous work [11], the compression factors
ting models. We adopt SZ [11], [12] to assess/detect under the SZ compressor may span a large range (e.g., from
the hardness of data compression because it exhibits 1.6:1 to 436:1) with the same specified error bound for differ-
outstanding compression factors in most of cases ent data sets. In our characterization (to be shown later),
based on our experiments. More details can be found many of the data sets exhibit the same order of magnitude
in Section 3. in the compression factor, no matter what lossy compres-
We propose three key optimization strategies to sors are adopted. Therefore, we denote hard-to-compress
improve the compression factor for hard-to-compress data sets the data sets for which high compression factors
data significantly. (1) We propose an optimized algo- are hard to reach with any data compressors.
rithm that can adaptively partition snapshot data into We adopt the error-bounded lossy compression model.
a set of best-fit consecutive segments each containing Specifically, the users are allowed to set an upper bound
similar values. The data compression is performed (denoted by D) for the compression errors. The compression
based on the segments, such that the compression fac- errors (defined as the difference between the data points’
tor can be improved significantly because of the simi- original values and their corresponding decompressed val-
lar data features in each segment. (2) More crucially, ues) of all the data points must be strictly limited in such a
we optimize the shifting offset for each segment bound; in other words, Xi0 must be in [Xi D; Xi +D], where
such that the XOR-leading-zero lengths1 can be maxi- Xi0 and Xi refer to a decompressed value and the corre-
mized during the compression. (3) We propose a sponding original value, respectively. We leave the determi-
light-weight adaptive compression method to further nation of error bound to users, because applications may
improve the compression factors, by selecting the have largely different features and diverse data sets such
best-fit compressors at runtime based on different that users may have quite different requirements.
variables. The key objective is to improve the error-bounded lossy
The optimization strategies proposed have been compression factors as much as possible, for hard-to-
implemented rigorously as a new compressor, sup- compress data sets. The compression factor or compression
porting C and Fortran. We evaluate it by running 13 ratio (denoted by r) is defined as the ratio of the original
benchmarks based on real-world scientific simula- total data size to the compressed data size. Suppose the
tion problems across different scientific domains on original data size So is reduced to Sc after the compression.
a large-scale cluster [13]. We compare it with numer- Then the compression factor is r=So /Sc . With the same
ous state-of-the-art compression methods (including error bound, a higher compression factor with a lower com-
SZ [11], [12], Gzip [7], FPC [14], ISABELA [9], ZFP pression/decompression time implies better results.
[15], Wavlet(SSEM) [8], and FPZIP [16]). Experi-
ments show that our solution can improve the com- 3 CHARACTERIZATION OF LOSSY COMPRESSION
pression factors by up to 49 percent, especially for LEVEL FOR HPC DATA
hard-to-compress data. The compression factors In this section, we characterize the lossy compression levels
range in 2.82:1 through 537:1 on the 13 benchmarks, and define the hard-to-compress data for this work.
based on our new compression technique. The benchmarks used in our investigation belong to six
The rest of the paper is organized as follows. We formu- different scientific domains: hybrodynamics (HD), magne-
late the HPC data compression problem in Section 2. In tohydrodynamics (MHD), gravity study (GRAV), particles
Section 3, we characterize the HPC data for which high simulation (PAR), shock simulation (SH), and climate simu-
compression factors are difficult to obtain. In Section 4, we lation (CLI). The data come from four HPC code packages
take an overview of previous design principle of SZ, and or models: FLASH [17], Nek5000 [18], HACC [1], and the
discuss why some data sets are hard to compress; we find Community Earth System Model (CESM) [2], as shown in
that SZ can serve as an indicator for hard-to-compress data. Table 1. For the benchmarks from FLASH code and
In Section 5, we describe three key optimization strategies Nek5000 code, each was run through 1,000 time steps, gen-
that work very effectively on the compression of hard-to- erating 1,000 snapshots (except for Orbit because its run
compress data. We present the evaluation results in Sec- met the termination condition at the time step 464). For the
tion 6. In Section 7, we discuss related work; and in Section 8 benchmarks provided in the FLASH code package, every
snapshot has 10+ variables, with a total of 82k–655 k data
1. XOR-Leading-zero length(or leading-zero length) [14] refers to points, which is comparable to the data size used by other
the number of leading zeros in the IEEE 754 representation of the XOR related research such as ISABELA [9] and NUMARCK [19].
operation of two consecutive floating-point data in the data set. That is,
it is equal to the number of exactly the same bit values in the beginning CICE was run with 500 time steps because that number is
part of the two consecutive floating-point data. already enough for its simulation to converge. The ATM
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 131
TABLE 1
Benchmarks Used in This Work
benchmark is 1.5 TB in size (it has 63 snapshots each being can always go up to several dozens with any lossy compres-
about 24 GB in size); thus, it is a good case to use for evalu- sor combined with Gzip. In contrast, the compression fac-
ating the compressor’s ability on extremely large data sizes. tors of Sedov is always below 10:1, with whatever lossy
The cosmology simulation is based on a parallel particle compressors are used. This observation motivates us to clas-
simulation code HACC [1]. All these benchmarks produce sify the data based on the level of compression factors.
double-precision floating-point data except for ATM and Based on the above analysis, we define hard-to-compress
Cosmology simulation, which adopt single precision in stor- data to be data sets whose compression factors are always
ing their data. relatively low in the lossy compression. Specifically, a data
We conduct this characterization work using three typi- set will be considered hard to compress if any of the existing
cal lossy compressors: SZ [11], ZFP [15], and ISABELA [9], error-bounded compressors will lead its compression factor
and other compressors exhibit similar or even worse results r to be less than 10:1. Our experiments indicate that a
(as shown in our previous work [11]). SZ comprises three remarkable portion of benchmarks (6 out of 14) are hard to
steps for the compression: the first step involves various compress, such as Sedov, BlastBS, Eddy, CICE and ATM.
curve-fitting models to approximate data values, the second
step analyzes the IEEE 754 binary representation for unpre- 4 ANALYSIS OF LOSSY COMPRESSOR SZ
dictable data, and the last step improves the compression
We start the overall analysis with our prior work, SZ,
ratio by the lossless compressor Gzip (a.k.a., deflate algo-
because it exhibits an outstanding compression quality
rithm). Gzip itself comprises a step of LZ77 that leverages
respecting error bounds in our experiments (see Table 2 for
symbols and string repetition and a step of Huffman encod-
details). In this section, we first present an overview of the
ing that performs variable length encoding. ZFP combines
design of SZ. We then provide an in-depth analysis of this
several techniques such as fixed-point integer conversion,
lossy compressor, focusing on what kinds of data are hard
block transform, and binary representation analysis with
to compress and the root causes; such information is funda-
bit-plain encoding. SZ and ZFP are both error-bounded
mental for the optimization of SZ lossy compression.
lossy compressors, and the error bound is set to 106 in our
characterization. Unlike SZ and ZFP, ISABELA is unable to TABLE 2
guarantee the absolute error bound, though it allows users Compression Ratios of Various Lossy Compressors on
to set a point-wise relative error bound. It converts the Different Data Sets (Error Bound = 106 ): Note that
multidimensional data to a sorted data series and then per- ISABELA does not Respect the Absolute Error
forms B-spline interpolation. In addition, we include two Bound, as Confirmed in our Previous Work [11]
improved versions (ZFP+Gzip and ISABELA+Gzip) for
Benchmark SZ ZFP ZFP+Gzip ISABELA ISABELA+Gzip
ZFP and ISABELA, using Gzip to further improve their
compression factors. Blast2 110.2 6.8 36.2 4.56 46.2
We note two things based on our experiments. The first is Sedov 7.44 5.99 7.06 4.42 7.44
BlastBS 3.26 3.65 3.78 4.43 5.06
that the data sets produced by different scientific simula- Eddy 8.13 8.96 9.53 4.34 5.18
tions may lead to significantly different compression factors Vortex 13.6 10.9 12.2 4.43 4.72
even under the same lossy compressor, because of the BrioWu 71.2 8.24 49.1 5 57.4
diverse features of simulation data. The second one is that GALLEX 183.6 36.7 92.7 4.89 33.6
one specific data set generally leads to similar compression MacLaurin 116 21.77 31.4 4.1 5.47
levels (or the same order of magnitude of compression fac- Orbit 433 85 157 4.96 8.43
ShafranovShock 48 4.43 29.5 4.24 12.2
tor) with different lossy compressors, especially in the cases
CICE 5.43 5.23 5.54 4.19 4.46
where the data are hard to compress with high factors. The ATM 4.02 3.17 3.49 3.1 3.7
compression factor of the Blast2 benchmark, for example,
132 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
TABLE 3
Compression Factor of SZ(w/S) versus SZ(w/oS)
4.1.3 Optimizing Lossy Compression for Unpredictable decompression cost to a low level, such that the compres-
Data by Binary Representation Analysis sion/decompression time will not be increased signifi-
cantly. In this sense, many of the advanced but time-
In this step, SZ compresses the unpredictable data one by
consuming techniques, such as data sorting (adopted by [9])
one, by analyzing their IEEE 754 binary representation.
and K-means clustering (used by [19]), cannot be used in
Because a closer-to-zero floating-point number requires
our solution.
fewer mantissa bits to be saved in order to obtain a specific
As described previously, the storage byte stream gener-
precision, SZ first converts all the data to another set of data
ated by SZ compression has two major parts: a bit array to
by a linear data normalization, such that all the converted
denote the best-fit curve-fitting models for the predictable
data are expected to be close to zero. Specifically, all unpre-
data and a stream of bytes representing the unpredictable
dictable data are normalized, by being subtracted by a fixed
data. The latter part can be further split into two subparts:
number. The fixed number is set to the middle-value, which
an XOR-leading-zero part and a significant-bytes part that
is equal to 12(min + max), where min and max refer to the
excludes XOR-leading-zero bytes. That is, the compressed
minimum value and maximum value in the whole data set,
size (or the compression factor) is dominated by these three
respectively. After that, SZ shrinks the storage size of the
parts. Accordingly, we characterize such information under
normalized data by removing the insignificant bytes in the
SZ based on the 13 benchmarks.
mantissa and using the XOR-leading-zero-based floating-
Fig. 3 presents the unpredictable ratio (i.e., the ratio of the
point compression method.
amount of unpredictable data to the total amount of data)
during the SZ lossy compression on the first 11 benchmarks
4.1.4 Further Shrinking the Compressed Size by Gzip listed in Table 1. The results for the last two benchmarks in
SZ further reduces the storage size by running the lossless Table 1 will be shown later because they have too few snap-
compressor Gzip on the compressed byte stream produced shots to present clearly with the other benchmarks in one
based on the above three steps. Note that since one snapshot figure. Combining Fig. 3 and Table 3 can reveal the relation-
often has many variables (or data arrays), we actually adopt ship between the unpredictable ratio and the compression
the Gzip step only once for all variables together when each factor. First of all, a high unpredictable ratio may lead to
of them has been processed by the previous SZ steps. This is a low compression factor in most of cases. For example,
because we observe that performing Gzip in batch for all GALLEX’s compression factor is up to 183.6, while its unpre-
variables in a snapshot can, sometimes, improve the com- dictable ratio is only 3.5 percent; and Sedov’s compression
pression factor prominently than performing Gzip on each factor is 7.44, while its unpredictable ratio is in [90,98 percent]
variable separately, probably because of the similar patterns on many time steps. Other benchmarks showing similar
or repeated values across variables. behavior include MacLaurin, Orbit, ShafranovShock, BlastBS,
and Vortex. The key reason is that a high unpredictable ratio
4.2 Analysis of Hard-to-Compress Data for SZ means that most of the data cannot be approximated by the
In this section, we provide an in-depth analysis of why SZ best-fit curve-fitting model in the lossy compression. We also
can obtain high compression factors in some cases but suffer note that the unpredictable ratio based on our curve-fitting
low compression factors in other cases. Although our model may not always dominate the compression factor. The
experiments (as shown in Table 3) show that the compres- Blast2 benchmark, for instance, is a typical example that
sion ratio of SZ is higher than that of other state-of-the-art exhibits a very high compression factor (about 110) while its
compressors significantly in many cases, SZ may not work unpredictable ratio is 80+ percent under our curve-fitting pre-
effectively on some hard-to-compress data sets, such as diction model. Such a high compression factor is due to the
Blast2 and Eddy. effective reduction of storage size in the lossy compression of
In fact, improving lossy compression for hard-to-com- unpredictable data (to be shown later). Specifically, when
press data is much more difficult than the original design XOR leading-zero lengths of most unpredictable data are
of a lossy compressor such as SZ. On the one hand, we equal to or a little longer than a multiple of 8, the unpredict-
need to thoroughly understand the hard-to-compress data able data compression will work very well in that most of
before optimizing the compression factors for those data. them requires only 2-bits XOR-leading-zero codes to repre-
On the other hand, we have to limit the compression/ sent their values.
134 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
and [p; n]. As a result, the above n data points would be parti-
tioned into three segments [0, i], [i; j], and [j; n], each of
which would be handled separately by the compressor later.
The proposed method can be performed rapidly because of
the fast processing on the exponent-partitioned interval
checking for each data point and the low theoretical time
complexity OðNÞ (discussed in more detail later). Moreover,
this method can be considered as a best-fit solution because
Fig. 6. Illustration of data partitioning.
it is able to partition the data set precisely based on a best-fit
to keep the edge indices for the segments, thus the data par- segment-merging function (to be discussed later) over the
titioning will introduce extra storage bytes. Hence, how to exponent-partitioned intervals.
optimize the partitioning and maximize compression fac- We present the pseudo code in Algorithm 1. All the seg-
tors becomes a challenging issue. In what follows, we pro- ments to be generated are organized in a doubly-linked list
pose a fast algorithm that can split the data set to the best-fit with an empty segment as a header.
consecutive segments effectively.
The basic idea is to make the data in each segment tend to Algorithm 1. Fast Best-Fit Data Partitioning
have the same exponent, such that their XOR-leading-zero Input: a sequence of data (denoted by X0 , X1 , , Xn ), the mini-
lengths are close to each other. We partition the floating- mum segment storage overhead threshold2 (denoted by h), user-
point space into multiple intervals whose sizes increase specified error bound (denoted by D).
exponentially, since the data values can span a large value Output: best-fit partitioning (denoted by S = {ES1 , ES2 , },
range across different exponents. Specifically, the floating- where ES refers to the segment partitioned based on exponent-
point space is partitioned into the following intervals or partitioned intervals.
groups (called exponent-partitioned intervals): . . ., [4, 2], 1: reqExpo getExponent(D).
[2, -1], [1, 0.5], [0.5, 0.25], . . ., 0, . . ., [0.25 0.5], [0.5, 1], 2: preExpo getExponent(X0 ).
[1, 2], [2, 4], . . .. We observe that each interval corresponds to 3: for (i = {1,2, ; n 1}) do
a unique exponent number based on the IEEE 754 represen- 4: curExpo getExponent(Xi ).
tation of the floating-point number. The exponent parts for 5: if (curExpo < reqExpo) then
the floating-point numbers in [0.5, 1], for example, are all 1, 6: curExpo reqExpo.
represented in the form of binary as 0,111,111,110 for double 7: end if
precision and 01,111,110 for single precision, respectively. 8: if (curExpo < preExpo) then
Hence, we can simply extract the exponent part of each data 9: preES createCandSeg(curExpo, i).
value to check the interval it belongs to, which is a rapid com- 10: else if (curExpo > preExpo) then
putation in that this operation does not involve whole float- 11: Call backTrackParsing(preES,curExpo,h), and denote
the latest settled segment by mergedES.
ing-point number parsing but only short-type integer
12: if (preES.fixed & preES.level < curExpo) then
parsing. The key idea of our solution is checking the expo-
13: preES createCandSeg(curExpo, i).
nent values for the data to compress and analyzing their
14: else
changes in the sequence, in order to partition them into dif- 15: preES mergedES.
ferent segments. Specifically, as long as an exponent of the 16: preES.length++.
current data value changes across the edge of the exponent- 17: end if
partitioned interval compared with that of the last data 18: else
point, we need to verify whether the amount of data col- 19: preES.length++.
lected is large enough for constructing a separate segment 20: end if
compared with the storage overhead (i.e., the extra storage 21: preExpo curExpo.
size introduced by recording the segment information for 22: end for
the data). If the sign of a data value is changed compared 23: Call backTrackParsing(preES, curExpo, h).
with its preceding data points and if the length of the current 24: Clear the whole segment set S, by merging the segment
segment is long enough, a candidate segment will also be whose value range size is smaller than D.
generated, otherwise, the change of signs will be ignored.
To illustrate the basic idea of our data-partitioning At the beginning (line 1) of the algorithm, the required
method, we give an example with n data points to compress. exponent value (denoted by reqExpo) is computed based on
As presented in Fig. 6, the data values span vertically differ- a user-specified error bound (denoted by D), in order to
ent exponent-partitioning intervals throughout the data set. determine the significant bits in the representation of the
Once some data point’s value (such as the data points i, j, k, p floating-point numbers. Specifically, reqExpo is equal to
shown in the figure) goes across the edge of an exponent-par- getExponent(D), where getExponent() is a function that
titioning interval compared with its preceding data value, extracts the exponent value from a floating-point number.
the data index is recorded, because the collected data set Next, the algorithm compares the exponent of each data
(such as [0, i], [i; j], [j; k], [k; p]) may construct a separate
segment. Note that the size of the candidate interval [k; p] is 2. The minimum segment storage overhead threshold is to avoid generat-
too small to obtain the gains of the storage-size saving ing segments in the data partitioning that are too small. Specifically,
since we need to keep the segment’s starting index (32 bits) and the seg-
against the storage overhead; hence, it should not be sepa- ment length (32 bits) to maintain each segment, the threshold is set to 64
rated but should be merged with its adjacent segments [j; k] (in bits) in our design.
136 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
Fig. 7. Illustration of segment merging. the highest level on the middle/current segment. In this situ-
ation, our merge function will simply return the current seg-
value (denoted by curExpo) and that of its preceding data ment. In the second group, the best-fit merging method is
value (denoted by preExpo) throughout the whole sequence right-merging, because otherwise the extra unnecessary
of the data. If the exponent of some data value is smaller higher level would be introduced, which might degrade the
than the reqExpo value (i.e., the data value itself is smaller compression factor for the current segment in turn. Let us
than the user-required error bound), its exponent will be take the case (d) as an example. Suppose that the current
flushed to the value of reqExpo (lines 5–7), because reqExpo segment were merged with left segment. Then the current
is the user-accepted exponent and thus can lead to more level lc would become lp instead of ln , leading to the larger
data being predictable by the curve-fitting. Then, the algo- value range for the lossy compression of the data in the cur-
rithm compares the values of curExpo and preExpo to deter- rent segment. This would, in turn, raise a larger deviation
mine whether the current data point index can be treated as with respect to the current segment, introducing coarser
a segment edge. Specifically, if curExpo is smaller than pre- compression granularity unexpectedly. Similarly, the best-fit
Expo (i.e., such as the data index i in Fig. 6), the current data segment merging method for all the cases in Group 3 is left-
index will be recorded by the algorithm, creating a candi- merging, which leads to the minimum exponent deviation
date segment (denoted by the function createCandSeg() in for the data in the current segment.
the pseudo code). By contrast, if the curExpo exhibits greater
values than preExpo (probably because of the sharp increase Algorithm 2. BackTrackParsing Algorithm
in the data value), then the algorithm will check whether
Input: the last candidate segment previously (denoted by
the previously created candidate segments are long enough
curES), the exponent of the current data value (denoted by cur-
to be treated as separate segments or should be merged Expo), and segment-storage-overhead threshold (denoted by h)
with other segments (line 11-17 in the pseudo code). The Output: The previously marked candidate segments are
details about this part are included in an iterative function, checked whether they should be merged or not).
backTrackParsing(), to be described later. The preES refers backTrackParsing(curES,curExpo,h)
to the preceding exponent-partitioned segment with respect 1: if (curES is fixed or curES is header) then
to the current data point. The rest of the code (lines 8–21) 2: return NULL.
updates the preES for checking the next data point; preES. 3: end if
fixed denotes whether the preES is already determined as a 4: preES the preceding segment of curES.
separate segment or not and preES.level refers to the corre- 5: mergedES merge(preES,curES,nextLevel,h).
sponding exponent value of the segment preES. The last 6: nextLevel curES.level.
step (line 24) of the algorithm checks each segment and 7: preES the preceding segment of the mergedES.
removes the one whose value range size is smaller than the 8: latestES backTrackParsing(preES,nextLevel,h).
error bound, because the data in this segment are all sup- 9: if (latestES is NULL) then
posed to be predictable. 10: return mergedES’s segment.
The backTrackparsing function aims to remove too short 11: else
candidate segments. In the example presented in Fig. 6, the 12: return latestES’s segment.
backTrackParsing() will be called at data points j, p, and n, 13: end if
respectively. Some of the candidate segments (such as [k; p]
shown in Fig. 6) will be merged with their preceding seg- The time complexity of our best-fit data-partitioning
ments because their sizes are too small compared with the algorithm is O(N): the algorithm needs to go over all data
segment storage overhead. The pseudo code of the back- points just once. In the iterative backTrackingParsing algo-
TrackParsing is presented in Algorithm 2. It tries merging rithm, each of the previously collected candidate segments
the current segment curES with its preceding segment by will also be checked only once. Also note that most of the
calling merge(preES, curES, nextLevel, h) iteratively. operations are working on short-type integers (i.e., expo-
The core of the backTrackParsing algorithm is the merge nent level), which means a fairly fast processing in practice.
function, which is illustrated in Fig. 7. There are eight possi- Fig. 8 shows that our partitioning algorithm can effec-
ble cases with regard to the different exponent levels of the tively split the data set into consecutive segments. The two
preceding segment (lp ), current segment (lc ), and next seg- data sets from Vortex and BlastBS are partitioned into 63
ment (ln ). All eight cases can be split into three groups. In the segments and 39 segments, respectively, such that the data
first group, the two types of segments exhibit a bump, with are all close to each other in every segment.
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 137
data may already lead to proper XOR-leading-zero lengths. compressor, and we call them SZ(w/oS) [11] and SZ(MD +
That is, the extra segmentation and optimization of shifting Q) [12] respectively. The detailed experiment setting of the
offsets may not improve compression factors but may even parameters used by SZ(w/oS) is consistent with that of the
degrade them because of the inevitable overhead. We note corresponding paper [11]. SZ(MD + Q) [12] is a rather new
that this is a unique case where the SZ(w/oS) happens to be version based on SZ model, which improves the prediction
close enough to optimal. accuracy at the data prediction step such that the unpredict-
To improve the compression factor in all situations, we able ratio could be reduced as much as possible. Specifi-
devised an adaptive compression method (namely SZ(Ada)) cally, it adopts multi-dimensional prediction instead of one-
by combining the SZ(w/S) and SZ(w/oS). We selected the dimensional prediction, and also adopts an error-controlled
best-fit solutions for different variables adaptively. Such a quantization method to encode the prediction values. As for
design is motivated by our observation that various com- SZ(MD + Q), we set the number of quantization bins to 128
pressors lead to very close compression factors on the same for all the benchmarks except for ATM, on which we set it
variables with short-distance snapshots. Fig. 10 presents the to 65,536, in order to reach a high compression factor con-
compression factors on 24 variables in three snapshots (time sidering the overhead of storing Huffman tree. If there are
step 15, 30, and 45), with respect to the benchmark ATM. multiple variables in a snapshot, we perform data predic-
We observe that the compression factor does not differ sig- tion and encoding on each variable and then perform Gzip
nificantly with different snapshots for the same variables. compression for all variables together in this snapshot.
Based on this analysis, our adaptive method SZ(Ada) per- We evaluate the compression quality based on the 13
forms either SZ(w/S)) or SZ(w/oS) on the compression of benchmarks listed in Table 1. The experiment setting for the
each variable in every snapshot, and the best-fit compressor 13 benchmarks can be found in Section 3. In our experi-
is checked periodically (every 20 snapshots in our implemen- ments, we adopt two important data-distortion metrics,
tation) and recorded in a bit-mask array: each bit represents maximum compression error and peak signal-to-noise ratio
either SZ(w/S) or SZ(w/oS) for a variable. Since the two sol- (PSNR), to evaluate the peak compression error and overall
utions have similar compression/decompression times (to compression error respectively. PSNR is defined as follows:
be shown later), the total compression/decompression time
of SZ(Ada) may increase little because of the periodic best-fit PSNR ¼ 20 log 10 ðvalue rangeÞ 10 log 10 ðMSEÞ: (2)
compressor checking (e.g., only 1/20 increment if the check-
ing period is 20 snapshots). Such an adaptive design can sig- where value_range and MSE refer to data value range and
nificantly improve the compression factors by up to 40 the mean squared compression error respectively.
percent in hard-to-compress cases, while still guaranteeing
the user-specified error bounds (shown in next section). 6.2 Experimental Results
6.2.1 Compression Factor
6 EVALUATION OF COMPRESSION QUALITY Table 4 presents the compression factors of 10 state-of-the-
art compressors based on a total of 13 benchmarks (note
We first describe the experimental setup used in the evalua- that ISA, ISA + Gzip, and SSEM are not error-bounded com-
tion and then present the evaluation results by comparing pressors). As highlighted in the table, SZ(Ada) leads to the
our solution with nine other state-of-the-art compressors. highest compression factors in most cases (8 out of 13
benchmarks). Its compression factor is even higher than the
6.1 Experimental Setup non-error-bounded compressors such as ISABELA and
In our experiments, we compared our approach with other SSEM. In absolute terms, SZ(Ada) improves the compres-
state-of-the-art compressors, including lossless compressors sion factors by up to 107 percent over our previous work SZ
such as Gzip and FPC and lossy compressors such as SZ, (w/oS), and by up to 49 percent for hard-to-compress cases.
ZFP(0.5.0), ISABELA, and Sasaki et al.’s approach (here The key reason SZ(Ada) can obtain such a significant
referred to as SSEM, based on the authors’ last names). improvement is that it can adaptively select SZ(w/S) in
A brief description of these compressors can be found in terms of variables at runtime for the hard-to-compress data
Section 7. There are two versions for the existing SZ and choose SZ(w/oS) when SZ(w/S)’s segmentation
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 139
TABLE 4
Compression Factors (the Percentages in Parentheses Refer to the Improvement Compared with Our Previous Work SZ(w/oS) [11])
Benchmark SZ(Ada) SZ(w/oS) SZ(MP+Q) ZFP ZFP+Gzip ISA ISA+Gzip SSEMa FPZIP-40b Gzip FPCc
Blast2 138 (25%) 110 211.8 6.8 36.2 4.56 46.2 39.7 22.9 77 11.4
Sedov 8.75 (17.6%) 7.44 7.84 5.99 7.06 4.42 7.44 17d 3.43 3.13 1.9
BlastBS 4.12 (26.4%) 3.26 4.0 3.65 3.78 4.43 5.06 8.45 2.43 1.24 1.29
Eddy 12.14 (49%) 8.13 11.87 8.96 9.53 4.34 5.18 N/A 2.56 5.5 3.89
Vortex 28.1 (107%) 13.6 21.29 10.9 12.2 4.43 4.72 12 3.35 2.23 2.34
BrioWu 104 (46%) 71.2 104.85 8.24 49.1 5 57.4 35.7 21.9 73 8.5
GALLEX 237 (29%) 183.6 255.1 36.7 92.7 4.89 33.6 82.4 20.35 34.7 11.37
MacLaurin 136 (17.2%) 116 110 21.77 31.4 4.1 5.47 7.44 3.84 2.03 2.08
Orbit 537 (24%) 433 433.2 85 157 4.96 8.43 11.7 3.9 1.8 1.86
ShafranovShock 54 (12.5%) 48 47 4.43 29.5 4.24 12.2 20.3 19.9 28 7.33
CICE 6.87 (26.5%) 5.43 6.71 5.23 5.54 4.19 4.46 3.83 2.3 2.6 2.67
ATM 4.27 (8.1%) 3.95 4.97 3.17 3.49 3.1 3.7 1.82 1.04 1.36 N/A
a
SSEM cannot work on Eddy because it requires each dimension to be an even size whereas Eddy data are 128 32 5 5 for each variable.
b
FPZIP-40 means 40 bits from among 64 bits are extracted to store for each floating-point data point. (For ATM, FPZIP adopts FPZIP-30 instead because FPZIP
does not support the precision of 40 bits for them.)
c
FPC cannot work on ATM because it does not support single-precision floating-point compression.
d
Note that SSEM does not respect the error-bound as confirmed in [11].
Fig. 11. Maximum compression errors of error-bounded lossy compressors (error bound is set to 106 ).
overhead is relatively huge compared with the compressed (106 as set in our experiments). Note that SZ(w/oS) and
size. We also note that SZ(MD + Q) works better than SZ ZFP both over-preserve the precision with varying degrees,
(Ada) in the ATM data set. The reason is that SZ(MD + Q) compared with the specified error bound. Specifically, the
adopts a multi-dimensional prediction method, which may compression errors with SZ(w/oS) are within [2 107 ; 106 ]
significantly reduce the number of unpredictable data. How for a vast majority of data, and ZFP’s compression errors
to integrate the advantage of SZ(MD + Q) and SZ(Ada) will are within [1 107 ; 4 107 ] for most of the data. In com-
be included in our future work. parison, SZ(Ada)’s compression errors are about 9 107
for majority of data, which can explain why SZ(Ada) works
6.2.2 Compression Error better than the other two to a certain extent.
In Fig. 11, we present the maximum compression errors cal-
culated after decompressing all the data for the three error- 6.2.3 Rate Distortion
bounded compressors: SZ(Ada), SZ(w/oS), and ZFP. We In Fig. 12, we present the rate-distortion results of five differ-
clearly observe that the three lossy compressors are all able ent compression techniques, including SZ(adaptive), SZ(with
to restrict the compression errors within the error bound segments), SZ(MD + Q), ZFP [15], and ZFP + Gzip, for three
140 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
TABLE 5 TABLE 6
Compression Performance (in MB/s) Decompression Performance (in MB/s)
ISA. ZFP ZFP+G SZ(w/S) SZ(MD+Q) SZ(A) Gzip ISA. ZFP ZFP+G SZ(w/S) SZ(MD+Q) SZ(A) Gzip
Blast2 6.1 98.4 52.8 90.5 140.5 88.4 107 Blast2 22.5 105 88.4 124.9 140.5 117.5 145.7
Sedov 5.74 71 45.7 49.3 52.4 47.8 20.2 Sedov 22.5 71 64.1 103.1 117.8 101.5 65
BlastBS 13.44 57.9 35.4 48 50 43 11.7 BlastBS 22.4 170 129.5 234.3 259 224 81
Eddy 5.73 47.1 25.7 45.8 60.7 45.6 53.7 Eddy 23 46.9 41.8 44.6 48.8 43.4 44.3
Vortex 5.37 67.4 38.4 78.4 80.6 73.4 31.5 Vortex 23.3 60.4 54.4 59.8 74.4 59.2 55.2
BrioWu 8.33 114.6 69.2 82.1 97.3 79.1 50.9 BrioWu 24 102.8 97.3 97.3 116.4 94.8 73.3
GALLEX 8.7 270 193 119 138.5 108 51.4 GALLEX 24.1 245.5 225 245.5 270 245.5 36
MacLau. 7.1 225 175 192.1 211.4 182.6 24.32 MacLau. 22.7 120 111.5 112.5 121.2 110.5 64.3
Orbit 8 217.1 168.9 200 226.9 183.1 14.9 Orbit 23 217.1 203 205 211.1 200 20.3
Shaf.Sh. 6.42 50.2 23.7 43.9 47.3 41.7 24 Shaf.Sh. 19.4 43.9 39.7 46.7 49.4 43.9 31.5
CICE 4.7 52.7 29.4 49.6 52.6 47.4 54.9 CICE 22.8 65.8 38 68.3 75.2 65.4 67.3
ATM N/A 58.6 24.4 38.4 41.4 36.6 22.9 ATM N/A 54.7 47.6 192.6 156.6 150.2 216.45
typical benchmarks due to the space limit of the paper. These ISABELA suffers from the highest compression cost because
three benchmarks are representatives of different research of its slow data-sorting step. The other four compressors
domains (Sedov is a shock simulation, CICE is a climate simu- exhibit the similar level of compression rate. Specifically,
lation, and HACC is a cosmology simulation). As for rate- ZFP exhibits the best compression performance in general.
distortion, rate is also known as bit-rate, refering to the number Note that SZ includes a lossless compression step (Gzip),
of bits used to represent a data point on average during the which may take a major portion in the total execution time.
compression (the smaller the better). Distortion is assessed Based on a breakdown of execution times, the Gzip step
using peak signal-to-noise ratio, which is a common criterion to takes about 30-50 percent on the compression time for SZ in
assess the overall compression error (the higher the better). most of cases. ZFP + Gzip will lead to much lower compres-
Based on the three figures, we note that SZ(ada) leads to the sion rate as shown in Table 5, because of its larger com-
best results with respect to the first two cases, and its bit-rate pressed size after its original compression.
is less than the second best compressor SZ(w/o segment) by The decompression performance is presented in Table 6.
10 percent and 50 percent on CICE and Sedov respectively. Similar to the compression performance, ISABELA suffers
The reason is three-fold: (1) we adapt an adaptive solution the lowest decompression rate (i.e., higher decompression
that selects the best-fit options for different variables dynami- time). We also observe that for all the benchmarks, SZ(w/
cally; (2) we optimize the unpredictable data compression S)’s decompression performance is close to that of ZFP in
using segmented offset-shifting method, which can improve most cases. The key reason that SZ(w/S) works fast on
the compression factor for hard-to-compress cases in particu- decompression is that it just needs to decode the best-fit
lar; (3) the Gzip step is performed on all variables (10 variables curve-fitting type and rebuilds the unpredictable data by
in Sedov and 5 variables in CICE) after each variable data is bitwise operations. SZ(Ada)’s decompression performance
separately processed with the previous SZ compression steps is close to that of SZ(w/S) because it adopts either SZ(w/
(including predictable data compression and unpredictable oS) or SZ(w/S) for each variable adaptively.
data compression). Since we have only one snapshot of We compare the performance of processing the cosmol-
HACC data set, we cannot evaluate SZ(Ada) in this case. ogy simulation data [1] with our compressor against the I/
Fig. 12c shows that SZ(w/ segment) has the similar rate- O performance without the compressor, as shown in Table 7.
distortion result with ZFP+Gzip, and it is less than the original We emulate the course of the in-situ compression at run-
ZFP compressor by 2 3 bits per data point. The reason ZFP time, by splitting the cosmology data into multiple pieces
may not work very well on HACC data set is that HACC data
are composed of multiple 1D arrays each representing parti- TABLE 7
cles’ partial information (such as coordinate value in one Parallel Processing Time (in Seconds): Cmpres Refers to
dimension) such that the adjacent data values in each 1D Compression, wr_cmpres_data Refers to Write Compression
array have no clear coherence, on which whereas ZFP Data, rd_cmpres_data Means Read Compression Data,
depends a lot. We also observe that SZ(MD + Q) exhibits the and Decmpres Refers to Decompression
best rate-distortion result on HACC data, because it can
#cores cmpres wr_cmpres_data sum rd_cmpres_data decmpres sum
significantly reduce the unpredictable ratio. One reason SZ
1 126.8 0.8 127.6 35.3 1.2 36.5
(MD + Q) may not work very effectively on Sedov and CICE
2 65 0.62 65.62 18.16 0.6 18.76
is that each snapshot in the FLASH and Nek5000 benchmarks
4 34 0.63 34.63 9.52 0.9 10.42
is relatively small, such that the constant Huffman-tree-stor- 8 18.38 0.8 19.18 5.23 0.69 5.92
ing overhead in SZ(MD + Q) is prominent. 16 9.38 0.82 10.2 2.62 0.35 2.97
32 4.79 0.7 5.49 1.36 0.26 1.62
64 2.46 0.8 3.26 0.73 0.15 0.88
6.2.4 Performance of Compression and 128 1.27 0.8 2.07 0.36 0.1 0.46
Decompression 256 0.86 0.79 1.65 0.21 0.07 0.28
512 0.44 0.8 1.24 0.1 0.04 0.14
We present in Table 5 the compression performance (MB/s) 1024 0.3 0.5 0.8 0.074 0.02 0.094
based on all snapshots for each benchmark. We note that
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 141
and performing the compression in parallel by different representation (BA), and Gzip lossless compression (Gzip).
ranks under an MPI program before storing the data into NUMARCK, for example, approximates the differences
the PFS. The simulcation scale ranges from 1 cores through between snapshots by vector quantization. ISABELA con-
1024 cores, which are from Argonne Blues cluster [13]. The verts the multidimensional data to a sorted data series and
writing time and reading time of the original data set (3.5 then performs B-spline interpolation. ZFP involves more
GB) through the parallel file system are, respectively, 4.9 complicated techniques such as fixed-point integer conver-
seconds and 4.1 seconds on average based on our experi- sion, block transform, and binary representation analysis
ments. Based on Table 7, we can see that the compression with bit-plane encoding. Fpzip adopts predictive coding
time and decompression time both decrease linearly with and also ignores insignificant bit planes in the mantissa
the number of cores. When the running scale is increased based on the analysis of IEEE 754 binary representation.
to 64 cores, the total overhead of writing data (i.e., com- SSEM splits data into a high-frequency part and low-
pression time + writing time = 3.26 seconds) already gets frequency part by wavelet transform and then uses vector
much lower than the time of writing the original data set quantization and Gzip. SZ is an error-bounded lossy com-
(4.9 seconds). When the parallel scale of the simulation is pressor proposed in [11]; it comprises four compression
up to 1024 cores, the overhead of writing data is down to steps as described in Section 4.1. In addition, we recently
only 1/5 of the time of writing the original data set. The improved prediction accuracy by adopting multi-prediction
data reading overhead will be less than 1/40 (0.094 second and error-controlled quantization model [12]. We compare
versus 4.1 seconds) of the time of reading the original data the compression techniques proposed in this paper to that
set, which is a significant improvement for the simulation approach as well, and we observe that the new solution out-
performance at runtime. The key reason for the high per- performs that one in most of cases, especially on rate-distor-
formance gain with respect to the reduction of data writ- tion metric.
ing/reading overhead is two-fold: on the one hand, the We presented in the preceding section the evaluation
compression/decompression time significantly decreases results by comparing our solution with all of the available
with the increasing number of cores (linearly) because of compressors3, using 13 applications across different scien-
no communication cost among different ranks; on the tific domains. Our new solution leads to significantly higher
other, the compressed size is much less than the original compression factors with comparable compression/decom-
data size, leading to a much lower I/O time cost. With val- pression times, and it also guarantees the user-specified
ue_range based relative error bound = 1E-4, the com- error bound.
pressed factor under our compressor is 2.73, compared In addition to the mesh-data based compressor, there are
with 1.48 under ZFP0.5.0 and 1.2 under Gzip. some other lossy compressors tailored for particular scien-
tific simulations: particle data compression related to molec-
7 RELATED WORK ular dynamics research or cosmology simulation is one
typical example. In this type of simulation, a very large
HPC data compressors can be split into two categories: loss- number of particles are simulated/anlayzed, and the key
less compressor [7], [14], [16] and lossy compressor [8], [9], information includes both position and velocity of each par-
[11], [12], [15], [19]. Lossless compressors can be further split ticle in three dimensions. How to compress the particle data
into general data compressors and floating-point data com- very effectively has been studied for years [31], [32], [33],
pressors. The former can compress any type of data stream, [34]. However, almost all of the related compressors are
including video streams. A typical example is Gzip [7], designed based on the trajectory analysis of the individual
which integrates the LZ77 [29] algorithm and Huffman particles along time steps, which requires the users to load/
encoding [30]. LZ77 algorithm makes use of a sliding win- keep multiple snapshots during the compression. This is
dow to search the same repeated sequences of the data and impractical when the number of particles is extremely large
replace them with references to only one single copy existing because of limited memory capacity to use. By contrast, our
earlier in the data stream. Huffman encoding [30] is an compressor allows to compress the snapshots separately,
Entropy-based lossless compression scheme which assigns which is very critical to the large-scale particle simulation
each symbol in the data stream a unique prefix-free code. that requires in-situ compression at runtime.
Floating-point data compressors compress a set of floating-
point numbers by analyzing the IEEE 754 binary representa- 8 CONCLUSION AND FUTURE WORK
tions of the data one by one. Typical examples include FPC
[14] and Fpzip [16], which leverage finite context models and In this paper, we present a novel error-bounded HPC
predictive coding of floating-point data, respectively. The floating-point data compressor. We propose an optimized
algorithm that can adaptively partition the data into a set of
common issue of such lossless compression methods is the
best-fit consecutive segmentations and also optimize the
relatively low compression ratio, which will significantly
shifting offset for the data transformation such that XOR-
limit the performance of the runtime data processing or post-
leading-zero lengths can be maximized. Our compressor
processing especially for exascale scientific simulation.
supports C and Fortran; and it can be downloaded under a
In recent years, many lossy compressors have been pro-
BSD license. Key findings are threefold:
posed to significantly reduce the data reading/writing cost
for large-scale HPC applications. Existing state-of-the-art
3. We did not include NUMARCK, because of three factors: (1) its
compressors often combine multiple strategies, such as vec-
code is unavailable to download; (2) it does not respect error bound as
tor quantization (VQ), orthogonal transform, curve-fitting shown in [11]; and (3) its compression ratio is not competitive with
approximation (CFA), analysis of floating-point binary others [11].
142 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
Its compression factor range is [2.82,538], which is [14] M. Burtscher and P. Ratanaworabhan, “High throughput com-
pression of double-precision floating-point data,” in Proc. Data
higher than many related lossy compressors in most Compression Conf., 2007, pp. 293–302.
of cases based on our experiments with 10+ bench- [15] P. Lindstrom, “Fixed-rate compressed floating-point arrays,” IEEE
marks across multiple research domains. Trans. Vis. Comput. Graph., vol. 20, no. 12, pp. 2674–2683, Dec.
The compression errors are always strictly limited in 2014.
[16] P. Lindstrom and M. Isenburg, “Fast and efficient compression of
the user-specified error bound. floating-point data,” IEEE Trans. Vis. Comput. Graph., vol. 12,
Its compression/decompression performance is com- no. 5, pp. 1245–1250, Sep.–Oct. 2006.
parable to those of other techniques. [17] ASCF Center, “FLASH User’s Guide (Version 4.2).” (2014).
[Online]. Available: http://flash.uchicago.edu/site/flashcode/
In future work, we plan to further explore new ideas user_support/flash4_ug_4p3.pdf
to improve the compression factors, e.g., by combining SZ [18] P. Fisher, “Nek5000 user guide.” (2010). [Online]. Available:
(MD + Q) and the techniques proposed in this paper. We http://www.mcs.anl.gov/ fischer/nek5000/examples.pdf
also plan to study the relationship between compression [19] Z. Chen, S. W. Son, W. Hendrix, A. Agrawal, W. Liao, and
A. Choudhary, “NUMARCK: Machine learning algorithm for
factor and the error bound, and support common HPC data resiliency and checkpointing,” in Proc. IEEE/ACM Supercomputing
formats such as netCDF and HDF5. Int. Conf. High Performance Comput. Netw. Storage Anal., 2014,
pp. 733–744.
[20] P. Colella and P. R. Woodward, “The piecewise parabolic method
ACKNOWLEDGMENTS (PPM) for gas-dynamical simulations,” J. Comput. Physics, vol. 54,
This research was supported by the Exascale Computing pp. 174–201, 1984.
[21] L. I. Sedov, Similarity and Dimensional Methods in Mechanics (10th
Project (ECP), Project Number: 17-SC-20-SC, a collaborative ed.). New York, NY, USA: Academic Press, 1959.
effort of two DOE organizations - the Office of Science and [22] A. L. Zachary, A. Malagoli, and P. Colella, “A higher-order godu-
the National Nuclear Security Administration, responsible nov method for multidimensional ideal magnetohydrodynamics,”
SIAM J. Scientific Comput., vol. 15, no. 2, pp. 263–284, 1994.
for the planning and preparation of a capable exascale [23] O. Walsh, “Eddy solutions of the navier-stokes equations,” in
ecosystem, including software, applications, hardware, Proc. Navier-Stokes Equations II - - Theory Numerical Methods, 1991,
advanced system engineering and early testbed platforms, pp. 306–309.
to support the nations exascale computing imperative. The [24] M. Brio and C. C. Wu, “An upwind differencing scheme for the
equations of ideal magnetohydrodynamics,” J. Comput. Physics,
submitted manuscript has been created by UChicago vol. 75, pp. 400–422, 1988.
Argonne, LLC, Operator of Argonne National Laboratory [25] A. Obabko, “Simulation of gallium experiment.” (2005). [Online].
(Argonne). Argonne, a U.S. Department of Energy Office of Available: http://www.cmso.info/ cmsopdf/princeton5oct05/
talks/Obabko-05.ppt
Science laboratory, is operated under Contract No. DE-
[26] V. D. Shafranov, “The structure of shock waves in a plasma,” Sov.
AC02-06CH11357. Phys. JETP, vol. 5, 1957, Art. no. 1183.
[27] D. Bailey, et al., “Community Ice CodE (CICE) user’s guide (ver-
sion 4.0).” [Online]. Available: http://www.cesm.ucar.edu/
REFERENCES models/ccsm4.0/cice/ ice_usrdoc.pdf
[1] S. Habib, V. Morozov, N. Frontiere, H. Finkel, A. Pope, and [28] J. Rice, Mathematical Statistics and Data Analysis (2nd edition).
K. Heitmann, “HACC: Extreme scaling and performance across Pacific Grove, CA, USA: Duxbury Press, 1995.
diverse architectures,” in Proc. Int. Conf. High Performance Comput. [29] J. Ziv and A. Lempel, “A universal algorithm for sequential data
Netwo. Storage Anal., 2013, pp. 1–10. compression,” IEEE Trans. Inf. Theory, vol. 23, no. 3, pp. 337–343,
[2] Community Earth Simulation Model (CESM). [Online]. Available: May 1977.
https://www2.cesm.ucar.edu/ [30] D. Huffman, “A method for the construction of minimum-redun-
[3] A. H. Baker, H. Xu, J. M. Dennis, M. N. Levy, D. Nychka, and S. A. dancy codes,” Proc. IRE, vol. 40, no. 9, pp. 1098–1101, Sep. 1952.
Mickelson, “A methodology for evaluating the impact of data [31] H. Ohtani, K. Hagita, A. M. Ito, T. Kato, T. Saitoh, and T. Takeda,
compression on climate simulation data,” in Proc. ACM 23rd Int. “Irreversible data compression concepts with polynomial fitting
Symp. High-Performance Parallel Distrib. Comput., 2014, pp. 203–214. in time-order of particle trajectory for visualization of huge parti-
[4] K. Paul, S. Mickelson, J. M. Dennis, H. Xu, and D. Brown, “Light- cle system,” in Proc. J. Physics: Conf. Series, vol. 45, no. 1, pp. 1–11.
weight parallel python tools for earth system modeling work- [32] D. Y. Yang, A. Grama, and V. Sarin, “Bounded-error compression
flows,” in Proc. IEEE Int. Conf. Big Data, 2015, pp. 1985–1994. of particle data from hierarchical approximate methods,” in Proc.
[5] Earch System Grid (ESG). [Online]. Available: https://www. IEEE/ACM Supercomput. Int. Conf. High Performance Comput. Netw.
earthsystemgrid.org/home.htm Storage Anal., 1999, Art. no. 32.
[6] A. H. Baker, et al., “Evaluating lossy data compression on climate [33] K. Hagita, T. Takeda, T. Kato, H. Ohtani, and S. Ishiguro,
simulation data within a large ensemble,” J. Geoscientific Model “Efficient data compression of time series of particles’ positions
Develop. Discussions, vol. 2016, pp. 1–38, 2016. for high-throughput animated visualization,” Proc. IEEE/ACM
[7] Gzip compression. [Online]. Available: http://www.gzip.org Supercomput. Int. Conf. High Performance Comput. Netw. Storage
[8] N. Sasaki, K. Sato, T. Endo, and S. Matsuoka, “Exploration of lossy Anal., 2013, pp. 1–2.
compression for application-level checkpoint/restart,” in Proc. [34] A. Kumar, X. Zhu, Y. Tu, and S. Pandit, “Compression in molecu-
IEEE 29th Parallel Distrib. Process. Symp., Proceed. Int., 2015, lar simulation datasets,” in Proc. Int. Conf. Intell. Sci. Big Data Eng.,
pp. 914–922. 2013, pp. 22–29.
[9] S. Lakshminarasimhan, et al., “Compressing the Incompressible
with ISABELA: In-situ Reduction of spatio-temporal data,” in
Proc. 17th Euro-Par11, 2011, pp. 366–379.
[10] Argonne MIRA system. [Online]. Available: http://www.alcf.anl.
gov/mira
[11] S. Di and F. Cappello, “Fast error-bounded lossy HPC data com-
pression with SZ,” in Proc. IEEE 30th Parallel Distrib. Process.
Symp., Proceed. Int., 2016, pp. 730–739.
[12] D. Tao, S. Di, Z. Chen, and F. Cappello, “Significantly improving
lossy compression for scientific data sets based onmultidimen-
sional prediction and error-controlled quantization,” in Proc. 31th
Int. Parallel Distrib. Process. Symp., 2017, pp. 1129–1139.
[13] Blues Cluster. [Online]. Available: http://www.lcrc.anl.gov/
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 143
Sheng Di received the master’s degree from the Franck Cappello is a program manager and
Huazhong University of Science and Technology, senior computer scientist at ANL. Before moving
in 2007 and the PhD degree from the University to ANL, he held a joint position at Inria and the
of Hong Kong, in 2011. He is currently an assis- University of Illinois at Urbana Champaign, where
tant computer scientist with Argonne National he initiated and co-directed from 2009 the Inria
Laboratory. His research interest involves resil- Illinois-ANL Joint Laboratory on Petascale Com-
ience on high-performance computing (such as puting. Until 2008, he led a team at Inria, where
silent data corruption, optimization checkpoint he initiated the XtremWeb (Desktop Grid) and
model, and in-situ data compression) and broad MPICH-V (fault-tolerant MPI) projects. From
research topics on cloud computing (including 2003 to 2008, he initiated and directed the
optimization of resource allocation, cloud network Grid5000 project, a nationwide computer science
topology, and prediction of cloud workload/hostload). He is working on platform for research in large-scale distributed systems. He has auth-
multiple HPC projects, such as detection of silent data corruption, char- ored papers in the domains of fault tolerance, high-performance comput-
acterization of failures and faults for HPC systems, and optimization of ing, Grids and contributed to more than 70 program committees. He is
mutlilevel checkpoint models. Contact him at [email protected]. He is a an editorial board member of the IEEE Transactions on Parallel and Dis-
member of the IEEE. tributed Systems, the International Journal on Grid Computing, the Jour-
nal of Grid and Utility Computing, and the Journal of Cluster Computing.
He is/was program co-chair of the IEEE CCGRID 2017, Award chair of
the ACM/IEEE SC15, Program co-chair of the ACM HPDC2014, Test of
time award chair of the IEEE/ACM SC13, Tutorial co-chair of the IEEE/
ACM SC12, Technical papers co-chair of the IEEE/ACM SC11, Program
chair of HiPC2011, pPogram co-chair of the IEEE CCGRID 2009, Pro-
gram Area co-chair of the IEEE/ACM SC09, General chair of the IEEE
HPDC 2006. He is fellow of the IEEE and a member of the ACM. Contact
him at [email protected].