0% found this document useful (0 votes)
8 views15 pages

Paper 3

This document presents an optimized error-bounded lossy compression algorithm specifically designed for hard-to-compress high-performance computing (HPC) data. The proposed method improves compression factors by adaptively partitioning data into segments with similar values and optimizing the shifting offset, achieving up to a 49% improvement compared to existing compressors. The evaluation demonstrates that the new compressor effectively maintains user-defined error bounds while significantly enhancing compression efficiency across various scientific benchmarks.

Uploaded by

Sheng Di
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views15 pages

Paper 3

This document presents an optimized error-bounded lossy compression algorithm specifically designed for hard-to-compress high-performance computing (HPC) data. The proposed method improves compression factors by adaptively partitioning data into segments with similar values and optimizing the shifting offset, achieving up to a 49% improvement compared to existing compressors. The evaluation demonstrates that the new compressor effectively maintains user-defined error bounds while significantly enhancing compression efficiency across various scientific benchmarks.

Uploaded by

Sheng Di
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO.

1, JANUARY 2018 129

Optimization of Error-Bounded Lossy


Compression for Hard-to-Compress HPC Data
Sheng Di , Member, IEEE and Franck Cappello, Fellow, IEEE

Abstract—Since today’s scientific applications are producing vast amounts of data, compressing them before storage/transmission is
critical. Results of existing compressors show two types of HPC data sets: highly compressible and hard to compress. In this work, we
carefully design and optimize the error-bounded lossy compression for hard-to-compress scientific data. We propose an optimized
algorithm that can adaptively partition the HPC data into best-fit consecutive segments each having mutually close data values, such
that the compression condition can be optimized. Another significant contribution is the optimization of shifting offset such that the
XOR-leading-zero length between two consecutive unpredictable data points can be maximized. We finally devise an adaptive method
to select the best-fit compressor at runtime for maximizing the compression factor. We evaluate our solution using 13 benchmarks
based on real-world scientific problems, and we compare it with 9 other state-of-the-art compressors. Experiments show that our
compressor can always guarantee the compression errors within the user-specified error bounds. Most importantly, our optimization
can improve the compression factor effectively, by up to 49 percent for hard-to-compress data sets with similar compression/
decompression time cost.

Index Terms—Error-bounded lossy compression, floating-point data compression, high performance computing, scientific simulation

Ç
1 INTRODUCTION

T ODAY’S scientific simulations are producing petabytes of


data, with the result that I/O cost has become a huge
bottleneck for on-line, in situ data processing as well as
The key challenge in designing a generic, efficient error-
bounded lossy compressor with high compression factors for
high-performance computing (HPC) applications is the large
postexecution data analysis. Hardware/Hybrid Accelerated diversity of scientific simulation data. Lossy compressors
Cosmology Code (HACC) [1], for example, can generate 20 often assume that such data follow regular characteristics, in
petabytes of data for a single 1-trillion-particle simulation; order to represent vast amounts of data by specific methods
yet a system such as Mira at ANL has only 26 petabytes of such as wavelet transform, vector quantization, and spline
file system storage, and a single user cannot request 75 per- interpolation. However, the real-world scientific simulation
cent of the total storage capacity for a simulation. HACC data often exhibit irregular characteristics, including various
users address this limitation by data decimation, storing an dimensions, different scales, and dynamic changes in both
order of magnitude less data than produced, which limits space and time. Some data values may span a large value
their study to a coarse grain or constrains the visualization range, such that the vector quantization with fixed number of
to the local area. Another typical application involved with bins may suffer from huge compression errors. The simula-
vast volume of data is climate simulation—Community tion data may also exhibit spiky changes in local areas, such
Earth Simulation Model [2], [3]. As indicated by the work of that the spline interpolation method may result in too many
Paul et al. [4], nearly 2.5 PB of data were produced by CESM knot points to keep [9]. Such irregular data characteristics
for the CMPI5, which further introduced 170 TB of postpro- may easily cause the data to be hard to compress by any of the
cessing data submitted to the Earth System Grid (ESG) [5]. existing lossy compressors (to be shown later).
Estimates of the raw data requirements for the CMIP6 project We note that hard-to-compress data may impede the
exceed 10 PB [6]. Because of the limited compression factor HPC execution/analysis performance more easily than
(or compression ratio) of lossless compressors (such as Gzip easy-to-compress data do. We give an example to illustrate
[7]) on floating-point data sets, lossy compression has been this point. Suppose there are two data sets whose original
studied for years, especially for the exascale execution that is storage sizes are both 800 TB, and they are produced on
expected to produce vast amounts of data [3], [8]. the Argonne MIRA system [10] with the same computation
time of 28 minutes. Their compressed sizes after lossy
compression are 400 TB and 8 TB, respectively. What if
 The authors are with the Mathematics and Computer Science (MCS) divi-
their compressed sizes can both be further reduced by 50
sion at Argonne National Laboratory, Lemont, IL 60439.
E-mail: [email protected], [email protected]. percent with a more effective compressor? Then, the data
200 TB
Manuscript received 29 Aug. 2016; revised 28 Aug. 2017; accepted 31 Aug. writing time can be reduced by 240 GB=s  14 minutes and
4 TB
2017. Date of publication 11 Sept. 2017; date of current version 8 Dec. 2017. 240 GB=s  17 seconds, respectively, considering the I/O
(Corresponding author: Sheng Di.) bandwidth of MIRA for users is 240 GB/s. The reduction
Recommended for acceptance by T. Hoefler. in writing time corresponds to 50 percent of the processing
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. time in the former case, whereas it corresponds to only 1
Digital Object Identifier no. 10.1109/TPDS.2017.2749300 percent in the latter case.
1045-9219 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
130 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

In this work, we present a new error-bounded lossy com- we summarize our conclusions and briefly disucss future
pressor for hard-to-compress data sets with three significant work.
contributions:
2 PROBLEM FORMULATION
 We carefully analyze what kinds of data sets are
hard to compress. Specifically, we find that hard-to- In this work, we focus mainly on how to compress the hard-
compress data lead to similar compression levels (or to-compress HPC data. None of lossy compressors can
the same order of magnitude of compression factors) achieve high compression factors on these simulation data
with different lossy compressors, and they are gener- (usually stored in the form of a floating-point array). As pre-
ally hard to be approximated accurately by curve-fit- sented in our previous work [11], the compression factors
ting models. We adopt SZ [11], [12] to assess/detect under the SZ compressor may span a large range (e.g., from
the hardness of data compression because it exhibits 1.6:1 to 436:1) with the same specified error bound for differ-
outstanding compression factors in most of cases ent data sets. In our characterization (to be shown later),
based on our experiments. More details can be found many of the data sets exhibit the same order of magnitude
in Section 3. in the compression factor, no matter what lossy compres-
 We propose three key optimization strategies to sors are adopted. Therefore, we denote hard-to-compress
improve the compression factor for hard-to-compress data sets the data sets for which high compression factors
data significantly. (1) We propose an optimized algo- are hard to reach with any data compressors.
rithm that can adaptively partition snapshot data into We adopt the error-bounded lossy compression model.
a set of best-fit consecutive segments each containing Specifically, the users are allowed to set an upper bound
similar values. The data compression is performed (denoted by D) for the compression errors. The compression
based on the segments, such that the compression fac- errors (defined as the difference between the data points’
tor can be improved significantly because of the simi- original values and their corresponding decompressed val-
lar data features in each segment. (2) More crucially, ues) of all the data points must be strictly limited in such a
we optimize the shifting offset for each segment bound; in other words, Xi0 must be in [Xi  D; Xi +D], where
such that the XOR-leading-zero lengths1 can be maxi- Xi0 and Xi refer to a decompressed value and the corre-
mized during the compression. (3) We propose a sponding original value, respectively. We leave the determi-
light-weight adaptive compression method to further nation of error bound to users, because applications may
improve the compression factors, by selecting the have largely different features and diverse data sets such
best-fit compressors at runtime based on different that users may have quite different requirements.
variables. The key objective is to improve the error-bounded lossy
 The optimization strategies proposed have been compression factors as much as possible, for hard-to-
implemented rigorously as a new compressor, sup- compress data sets. The compression factor or compression
porting C and Fortran. We evaluate it by running 13 ratio (denoted by r) is defined as the ratio of the original
benchmarks based on real-world scientific simula- total data size to the compressed data size. Suppose the
tion problems across different scientific domains on original data size So is reduced to Sc after the compression.
a large-scale cluster [13]. We compare it with numer- Then the compression factor is r=So /Sc . With the same
ous state-of-the-art compression methods (including error bound, a higher compression factor with a lower com-
SZ [11], [12], Gzip [7], FPC [14], ISABELA [9], ZFP pression/decompression time implies better results.
[15], Wavlet(SSEM) [8], and FPZIP [16]). Experi-
ments show that our solution can improve the com- 3 CHARACTERIZATION OF LOSSY COMPRESSION
pression factors by up to 49 percent, especially for LEVEL FOR HPC DATA
hard-to-compress data. The compression factors In this section, we characterize the lossy compression levels
range in 2.82:1 through 537:1 on the 13 benchmarks, and define the hard-to-compress data for this work.
based on our new compression technique. The benchmarks used in our investigation belong to six
The rest of the paper is organized as follows. We formu- different scientific domains: hybrodynamics (HD), magne-
late the HPC data compression problem in Section 2. In tohydrodynamics (MHD), gravity study (GRAV), particles
Section 3, we characterize the HPC data for which high simulation (PAR), shock simulation (SH), and climate simu-
compression factors are difficult to obtain. In Section 4, we lation (CLI). The data come from four HPC code packages
take an overview of previous design principle of SZ, and or models: FLASH [17], Nek5000 [18], HACC [1], and the
discuss why some data sets are hard to compress; we find Community Earth System Model (CESM) [2], as shown in
that SZ can serve as an indicator for hard-to-compress data. Table 1. For the benchmarks from FLASH code and
In Section 5, we describe three key optimization strategies Nek5000 code, each was run through 1,000 time steps, gen-
that work very effectively on the compression of hard-to- erating 1,000 snapshots (except for Orbit because its run
compress data. We present the evaluation results in Sec- met the termination condition at the time step 464). For the
tion 6. In Section 7, we discuss related work; and in Section 8 benchmarks provided in the FLASH code package, every
snapshot has 10+ variables, with a total of 82k–655 k data
1. XOR-Leading-zero length(or leading-zero length) [14] refers to points, which is comparable to the data size used by other
the number of leading zeros in the IEEE 754 representation of the XOR related research such as ISABELA [9] and NUMARCK [19].
operation of two consecutive floating-point data in the data set. That is,
it is equal to the number of exactly the same bit values in the beginning CICE was run with 500 time steps because that number is
part of the two consecutive floating-point data. already enough for its simulation to converge. The ATM
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 131

TABLE 1
Benchmarks Used in This Work

lumn1—c—Domain Name Code Description Size


Blast2 [20] Flash Strong shocks and narrow features 787 MB
Sedov [21] Flash Hydrodynamical test code involving strong shocks and nonplanar symmetry 660 MB
HD BlastBS [22] Flash 3D version of the MHD spherical blast wave problem 984 MB
Eddy [23] Nek5k 2D solution to Navier-Stokes equations with an additional translational velocity 820 MB
Vortex [18] Nek5k Inviscid vortex propagation: earlier studies of finite volume methods 580 MB
MHD BrioWu [24] Flash Coplanar magnetohydrodynamic counterpart of hydrodynamic Sod problem 1.1 GB
GALLEX [25] Nek5k Simulation of gallium experiment (a radiochemical neutrino detection experiment) 270 MB
GRAV MacLaurin [17] Flash MacLaurin spheroid (gravitational potential at the surface/inside a spheroid) 6.3 GB
PAR Orbit [17] Flash testing the mapping of particle positions to gridded density fields, 152 MB
mapping of gridded potentials onto particle positions
CosmoSim[1] HACC Cosmology Simulation with 147 million particles 3.5 GB
SH ShafranovShock [26] Flash A problem that provides a good verification for structure of 1D shock waves in a 246 MB
two-temperature plasma with separate ion and electron temperatures
CICE [27] CESM Community sea-ice simulation based on Community Earth System Model 3.7 GB
CLI ATM [2] CESM CAM-SE cubed sphere atmosphere simulation with very large data size produced 1.5 TB

benchmark is 1.5 TB in size (it has 63 snapshots each being can always go up to several dozens with any lossy compres-
about 24 GB in size); thus, it is a good case to use for evalu- sor combined with Gzip. In contrast, the compression fac-
ating the compressor’s ability on extremely large data sizes. tors of Sedov is always below 10:1, with whatever lossy
The cosmology simulation is based on a parallel particle compressors are used. This observation motivates us to clas-
simulation code HACC [1]. All these benchmarks produce sify the data based on the level of compression factors.
double-precision floating-point data except for ATM and Based on the above analysis, we define hard-to-compress
Cosmology simulation, which adopt single precision in stor- data to be data sets whose compression factors are always
ing their data. relatively low in the lossy compression. Specifically, a data
We conduct this characterization work using three typi- set will be considered hard to compress if any of the existing
cal lossy compressors: SZ [11], ZFP [15], and ISABELA [9], error-bounded compressors will lead its compression factor
and other compressors exhibit similar or even worse results r to be less than 10:1. Our experiments indicate that a
(as shown in our previous work [11]). SZ comprises three remarkable portion of benchmarks (6 out of 14) are hard to
steps for the compression: the first step involves various compress, such as Sedov, BlastBS, Eddy, CICE and ATM.
curve-fitting models to approximate data values, the second
step analyzes the IEEE 754 binary representation for unpre- 4 ANALYSIS OF LOSSY COMPRESSOR SZ
dictable data, and the last step improves the compression
We start the overall analysis with our prior work, SZ,
ratio by the lossless compressor Gzip (a.k.a., deflate algo-
because it exhibits an outstanding compression quality
rithm). Gzip itself comprises a step of LZ77 that leverages
respecting error bounds in our experiments (see Table 2 for
symbols and string repetition and a step of Huffman encod-
details). In this section, we first present an overview of the
ing that performs variable length encoding. ZFP combines
design of SZ. We then provide an in-depth analysis of this
several techniques such as fixed-point integer conversion,
lossy compressor, focusing on what kinds of data are hard
block transform, and binary representation analysis with
to compress and the root causes; such information is funda-
bit-plain encoding. SZ and ZFP are both error-bounded
mental for the optimization of SZ lossy compression.
lossy compressors, and the error bound is set to 106 in our
characterization. Unlike SZ and ZFP, ISABELA is unable to TABLE 2
guarantee the absolute error bound, though it allows users Compression Ratios of Various Lossy Compressors on
to set a point-wise relative error bound. It converts the Different Data Sets (Error Bound = 106 ): Note that
multidimensional data to a sorted data series and then per- ISABELA does not Respect the Absolute Error
forms B-spline interpolation. In addition, we include two Bound, as Confirmed in our Previous Work [11]
improved versions (ZFP+Gzip and ISABELA+Gzip) for
Benchmark SZ ZFP ZFP+Gzip ISABELA ISABELA+Gzip
ZFP and ISABELA, using Gzip to further improve their
compression factors. Blast2 110.2 6.8 36.2 4.56 46.2
We note two things based on our experiments. The first is Sedov 7.44 5.99 7.06 4.42 7.44
BlastBS 3.26 3.65 3.78 4.43 5.06
that the data sets produced by different scientific simula- Eddy 8.13 8.96 9.53 4.34 5.18
tions may lead to significantly different compression factors Vortex 13.6 10.9 12.2 4.43 4.72
even under the same lossy compressor, because of the BrioWu 71.2 8.24 49.1 5 57.4
diverse features of simulation data. The second one is that GALLEX 183.6 36.7 92.7 4.89 33.6
one specific data set generally leads to similar compression MacLaurin 116 21.77 31.4 4.1 5.47
levels (or the same order of magnitude of compression fac- Orbit 433 85 157 4.96 8.43
ShafranovShock 48 4.43 29.5 4.24 12.2
tor) with different lossy compressors, especially in the cases
CICE 5.43 5.23 5.54 4.19 4.46
where the data are hard to compress with high factors. The ATM 4.02 3.17 3.49 3.1 3.7
compression factor of the Blast2 benchmark, for example,
132 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

Fig. 1. Illustration of SZ lossy compression.


Fig. 2. Illustration of fitting models.
4.1 Overview of SZ
As presented in Fig. 1, given a set of floating-point data line constructed using its previous two consecutive
ðLÞ
(denoted as original data), they will be split into two values. Specifically, the predicted value Xi is
ðLÞ
groups, predictable data and unpredictable data, based on a derived as Xi = Xi1 + (Xi1  Xi2 Þ ¼ 2Xi1  Xi2 .
best-fit curve-fitting model (to be detailed later). The pre-  Quadratic-Curve Fitting (QCF): Quadratic-curve fit-
dictable data are denoted by the 2-bit code of the corre- ting model assumes that the current value Vi can be
sponding curve-fitting method, while the unpredictable predicted precisely by a quadratic curve that is con-
data are further compressed by analyzing the XOR-leading- structed by the previous three consecutive values.
zero bytes between adjacent unpredictable data and the sig- Specifically, a quadratic curve (denoted by fðxÞ ¼
nificance of bytes in mantissa based on user-specified error ax2 þ bx þ c) can be denoted as (0, Xði  3Þ), (1,
bounds. Xði  2Þ), and (2, Xði  1Þ), respectively. Then, the
Specifically, the lossy compression of SZ involves the fol- predicted value at i can be computed by fð3Þ =
lowing four steps. 9a þ 3b þ c, where a, b, and c are computed by the
three preceding points (0, Xði  3Þ), (1, Xði  2Þ),
ðQÞ
4.1.1 Linearization of Multidimensional Array and (2, Xði  1Þ). Hence, the predicted value Xi
ðQÞ
The first step of SZ is called linearization. SZ uses the intrin- can be derived as Xi =fð3Þ ¼ 3Xi1  3Xi2 + Xi3 .
sic memory sequence of the data array to serve as the trans- Fig. 2 presents an example to further illustrate the above
formed 1-D data sequence for compression. The key three fitting models. In the figure, three predicted values for
advantage of such a design is two-fold: (1) extremely low the current data value Vi are denoted by the black cross,
cost of the transform (because we just need to cast the blue cross and red cross respectively. They are all predicted
multi-dimensional array to 1-D array) and (2) good locality by the previous consecutive decompressed value(s), which
preservation (only except for the edges of the multi-dimen- are either predicted values generated in the compression or
sional arrays). the unpredictable values stored separately. Note that it is
critical that one should not directly use original preceding
4.1.2 Compressing the Linearized Data by Best-Fit data values {Vi3 ; Vi2 ; Vi1 } to perform the prediction for
Curve-fitting Models the data value Xi , since the preceding data that are to be
In what follows, we discuss how the SZ compressor deals used in the decompression are not the original preceding
with the 1-D array {V1 ; V2 ; . . . ; VN }. The basic idea is check- data values but the decompressed values with a certain
ing each data point in the 1-D array one by one, to see if it errors. Such a design guarantees the decompressed value
can be predicted (within user-required error bounds) based Xi to meet user-required error bounds. A pseudo-code was
on a few of its preceding values by some curve-fitting model provided in our previous conference paper [11].
(such as linear-curve or quadratic curve). If yes, the corre- This compression method may suffer from a low mem-
sponding curve-fitting model is recorded for that point in a ory overhead, because at most three preceding consecutive
bit-array. The data that cannot be predicted are called unpre- values (Xi3 , Xi2 , Xi1 ) are required for checking the
dictable data and they are to be compressed by analyzing the predictability of the value Vi such that it needs to keep only
IEEE 754 binary representation. three extra preceding decompressed values at runtime
For the data prediction, three curve-fitting models are instead of all of the decompressed values. Suppose there are
adopted: preceding neighbor fitting, linear-curve fitting, and N data points to compress, the total memory overhead is
quadratic-curve fitting, which are described as follows: only 2Nþ64M
64N
1
= 32 +MN of the original memory size, where M
refers to the amount of unpredictable data.
 Preceding Neighbor Fitting (PNF): This is the simplest The time complexity of the algorithm is O(N), where
prediction model, which just uses the preceding value N here refers to the amount of floating-point data. More-
to fit the current value. Suppose the current value is Vi , over, the major part of the algorithm involves only bit-
ðNÞ
then its predicted value (denoted by Xi ) will be esti- wise operations, so the processing speed is supposed to
ðNÞ
mated as Xi =Xi1 . Note that the preceding data be very fast.
used in the decompression are not original values, so The decompression is just a reverse procedure of the
the PNF prediction here is supposed to be Xi1 instead above compression algorithm. Specifically, it first parses the
of Vi1 . More details will be discussed later. bit-array y to retrieve the predictability and bestfit model
 Linear-Curve Fitting (LCF): This fitting model assumes information. If the current value is predictable, it will be
that the current value Vi can be estimated by the linear reconstructed by the corresponding curve-fitting model, or
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 133

TABLE 3
Compression Factor of SZ(w/S) versus SZ(w/oS)

Benchmark SZ(w/S) SZ(w/oS) Benchmark SZ(w/S) SZ


Blast2 100 110 Sedov 8.45 7.44
BlastBS 4.02 3.26 Eddy 11.86 8.13
Vortex 21.45 13.6 BrioWu 87 71.2
GALLEX 184 183.6 MacLaurin 127.2 116
Orbit 519 433 Shaf.Shock 54 48
CICE 6.56 5.43 ATM 3.8 3.95

else, it can be found in a separate data array g and it will be


recovered by the binary-representation analysis. Fig. 3. Unpredictable ratio in SZ lossy compression.

4.1.3 Optimizing Lossy Compression for Unpredictable decompression cost to a low level, such that the compres-
Data by Binary Representation Analysis sion/decompression time will not be increased signifi-
cantly. In this sense, many of the advanced but time-
In this step, SZ compresses the unpredictable data one by
consuming techniques, such as data sorting (adopted by [9])
one, by analyzing their IEEE 754 binary representation.
and K-means clustering (used by [19]), cannot be used in
Because a closer-to-zero floating-point number requires
our solution.
fewer mantissa bits to be saved in order to obtain a specific
As described previously, the storage byte stream gener-
precision, SZ first converts all the data to another set of data
ated by SZ compression has two major parts: a bit array to
by a linear data normalization, such that all the converted
denote the best-fit curve-fitting models for the predictable
data are expected to be close to zero. Specifically, all unpre-
data and a stream of bytes representing the unpredictable
dictable data are normalized, by being subtracted by a fixed
data. The latter part can be further split into two subparts:
number. The fixed number is set to the middle-value, which
an XOR-leading-zero part and a significant-bytes part that
is equal to 12(min + max), where min and max refer to the
excludes XOR-leading-zero bytes. That is, the compressed
minimum value and maximum value in the whole data set,
size (or the compression factor) is dominated by these three
respectively. After that, SZ shrinks the storage size of the
parts. Accordingly, we characterize such information under
normalized data by removing the insignificant bytes in the
SZ based on the 13 benchmarks.
mantissa and using the XOR-leading-zero-based floating-
Fig. 3 presents the unpredictable ratio (i.e., the ratio of the
point compression method.
amount of unpredictable data to the total amount of data)
during the SZ lossy compression on the first 11 benchmarks
4.1.4 Further Shrinking the Compressed Size by Gzip listed in Table 1. The results for the last two benchmarks in
SZ further reduces the storage size by running the lossless Table 1 will be shown later because they have too few snap-
compressor Gzip on the compressed byte stream produced shots to present clearly with the other benchmarks in one
based on the above three steps. Note that since one snapshot figure. Combining Fig. 3 and Table 3 can reveal the relation-
often has many variables (or data arrays), we actually adopt ship between the unpredictable ratio and the compression
the Gzip step only once for all variables together when each factor. First of all, a high unpredictable ratio may lead to
of them has been processed by the previous SZ steps. This is a low compression factor in most of cases. For example,
because we observe that performing Gzip in batch for all GALLEX’s compression factor is up to 183.6, while its unpre-
variables in a snapshot can, sometimes, improve the com- dictable ratio is only 3.5 percent; and Sedov’s compression
pression factor prominently than performing Gzip on each factor is 7.44, while its unpredictable ratio is in [90,98 percent]
variable separately, probably because of the similar patterns on many time steps. Other benchmarks showing similar
or repeated values across variables. behavior include MacLaurin, Orbit, ShafranovShock, BlastBS,
and Vortex. The key reason is that a high unpredictable ratio
4.2 Analysis of Hard-to-Compress Data for SZ means that most of the data cannot be approximated by the
In this section, we provide an in-depth analysis of why SZ best-fit curve-fitting model in the lossy compression. We also
can obtain high compression factors in some cases but suffer note that the unpredictable ratio based on our curve-fitting
low compression factors in other cases. Although our model may not always dominate the compression factor. The
experiments (as shown in Table 3) show that the compres- Blast2 benchmark, for instance, is a typical example that
sion ratio of SZ is higher than that of other state-of-the-art exhibits a very high compression factor (about 110) while its
compressors significantly in many cases, SZ may not work unpredictable ratio is 80+ percent under our curve-fitting pre-
effectively on some hard-to-compress data sets, such as diction model. Such a high compression factor is due to the
Blast2 and Eddy. effective reduction of storage size in the lossy compression of
In fact, improving lossy compression for hard-to-com- unpredictable data (to be shown later). Specifically, when
press data is much more difficult than the original design XOR leading-zero lengths of most unpredictable data are
of a lossy compressor such as SZ. On the one hand, we equal to or a little longer than a multiple of 8, the unpredict-
need to thoroughly understand the hard-to-compress data able data compression will work very well in that most of
before optimizing the compression factors for those data. them requires only 2-bits XOR-leading-zero codes to repre-
On the other hand, we have to limit the compression/ sent their values.
134 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

Fig. 5. Middle value (a.k.a., median value) based on data normalization.

design an efficient method that can partition the whole data


Fig. 4. Average number of XOR-leading-zero bits per data in SZ. set into best-fit consecutive segments, such that the data
compression in each segment can be performed more effec-
To understand the effectiveness of SZ’s lossy compres- tively because of the similar values in each segment. Second,
sion on unpredictable data, we further characterize the we transform the data in each best-fit segment by an opti-
average XOR-leading-zero length (i.e., average number mal shifting offset, such that the XOR-leading-zero lengths
of XOR-leading-zero bits per data point compared with its can be maximized. The new SZ version will be called SZ
preceding data point). Fig. 4 shows that the cases with both with bestfit segmentation and optimized shifting offset,
high compression factors and high unpredictable ratios abbreviated as SZ(w/S) in the following text.
generally have a relatively long XOR-leading-zero length.
For instance, Blast2 has a relatively large number of XOR- 5.1 Analysis of Middle-Value Based Data
leading-zero bits per data point during the lossy compres- Normalization
sion. Similarly, if the XOR-leading-zero length is relatively As presented in Section 4.1, SZ performs a middle-value
short, the compression factor will be significantly limited. A based data normalization, in order to transform all the data
typical example is CICE. It has relatively low unpredictable to close-to-zero values, because closer-to-zero numbers have
ratio, while it suffers from low compression factor because fewer significant mantissa bits based on the user-specified
of the fairly short XOR-leading-zero length (about only 10) compression error bounds. However, there are significant
on average. The key reason long XOR-leading-zero length drawbacks in the middle-value based data normalization.
leads to higher compression factor is that more XOR-lead- The first issue is that the fixed middle value may also
ing-zero bits means more saving to gain on the left part of severely degrade the effectiveness of the data normalization,
the floating-point numbers (note: the XOR-leading-zero if the data to compress spans a large value range and the
part in the compressed stream is fixed to 2-bits in length for data exhibits multiple spiky value changes throughout the
each data point). data set. As shown in Fig. 5a, the snapshot from BlastBS
Based on the above analysis, we can summarize two criti- benchmark exhibits a very large value range [0.9592388,
cal rules to obtain high compression factors with SZ. 1604.383093]. Its size is 1,603.4, such that the middle value is
computed as 802. In this situation, the compression factor
 Rule 1: Unpredictable ratio should be limited. cannot be improved by data normalization but may even be
 Rule 2: XOR-leading-zero length should be maximized. degraded because most of the original data were already
close to zero before the data normalization.
5 OPTIMIZATION OF LOSSY COMPRESSION BASED The second issue is that the middle-value based data nor-
ON SZ COMPRESSION MODEL malization may lead to an over-normalization problem. As
To improve the compression of hard-to-compress data sets, shown in Fig. 5b, the size (0.007) of the value range [0.642,
we chose to use SZ as the basis of our research. Compared 0.649] for the BrioWu benchmark is small compared with
with SZ, our new design keeps only the curve-fitting mod- the data values (around 0.645). In this case, subtracting all
els, in order to keep the high compression factor for easy-to- the data by the middle value 0.6455 will generate another
compress data. The compression technique in coping with set of data that are fairly close to zero. The transformed data
the unpredictable data will be changed drastically. will require fewer mantissa bits to meet the error-bound
As discussed in last section, one idea (i.e., Rule 1) for requirement as expected. However, if all the data are
improving the compression factor is to improve the predic- extremely close to zero, the XOR-leading-zero length will
tion accuracy in the best-fit curve-fitting phase. To this end, likely be short as well, because two close-to-zero numbers
one can devise more advanced prediction methods, so that will likely have different exponent parts in the IEEE 754
more data values can be predicted more accurately. For this representation.
part, we proposed another technique in another piece
of work, namely SZ(MD + Q) [12], which adopts multi- 5.2 Optimization of Best-Fit Segmentation
dimensional prediction plus error-controlled quantization We observe that the data set often exhibits multiple seg-
method to increase the data prediction accuracy. In this ments (as to be shown in Fig. 8), thereforce, we partition the
paper, we focus on the other part: how to improve the com- data set into different segments such that the data in each
pression for hard-to-compress data based on Rule 2. segment exhibit values closest to one another. This will lead
As for the hard-to-compress cases, we propose two criti- to consistent XOR-leading-zero lengths for all the data
cal solutions to improve the compression factors. First, we points in the same segment. Note that the compressor needs
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 135

and [p; n]. As a result, the above n data points would be parti-
tioned into three segments [0, i], [i; j], and [j; n], each of
which would be handled separately by the compressor later.
The proposed method can be performed rapidly because of
the fast processing on the exponent-partitioned interval
checking for each data point and the low theoretical time
complexity OðNÞ (discussed in more detail later). Moreover,
this method can be considered as a best-fit solution because
Fig. 6. Illustration of data partitioning.
it is able to partition the data set precisely based on a best-fit
to keep the edge indices for the segments, thus the data par- segment-merging function (to be discussed later) over the
titioning will introduce extra storage bytes. Hence, how to exponent-partitioned intervals.
optimize the partitioning and maximize compression fac- We present the pseudo code in Algorithm 1. All the seg-
tors becomes a challenging issue. In what follows, we pro- ments to be generated are organized in a doubly-linked list
pose a fast algorithm that can split the data set to the best-fit with an empty segment as a header.
consecutive segments effectively.
The basic idea is to make the data in each segment tend to Algorithm 1. Fast Best-Fit Data Partitioning
have the same exponent, such that their XOR-leading-zero Input: a sequence of data (denoted by X0 , X1 ,   , Xn ), the mini-
lengths are close to each other. We partition the floating- mum segment storage overhead threshold2 (denoted by h), user-
point space into multiple intervals whose sizes increase specified error bound (denoted by D).
exponentially, since the data values can span a large value Output: best-fit partitioning (denoted by S = {ES1 , ES2 ,    },
range across different exponents. Specifically, the floating- where ES refers to the segment partitioned based on exponent-
point space is partitioned into the following intervals or partitioned intervals.
groups (called exponent-partitioned intervals): . . ., [4, 2], 1: reqExpo getExponent(D).
[2, -1], [1, 0.5], [0.5, 0.25], . . ., 0, . . ., [0.25 0.5], [0.5, 1], 2: preExpo getExponent(X0 ).
[1, 2], [2, 4], . . .. We observe that each interval corresponds to 3: for (i = {1,2,   ; n  1}) do
a unique exponent number based on the IEEE 754 represen- 4: curExpo getExponent(Xi ).
tation of the floating-point number. The exponent parts for 5: if (curExpo < reqExpo) then
the floating-point numbers in [0.5, 1], for example, are all 1, 6: curExpo reqExpo.
represented in the form of binary as 0,111,111,110 for double 7: end if
precision and 01,111,110 for single precision, respectively. 8: if (curExpo < preExpo) then
Hence, we can simply extract the exponent part of each data 9: preES createCandSeg(curExpo, i).
value to check the interval it belongs to, which is a rapid com- 10: else if (curExpo > preExpo) then
putation in that this operation does not involve whole float- 11: Call backTrackParsing(preES,curExpo,h), and denote
the latest settled segment by mergedES.
ing-point number parsing but only short-type integer
12: if (preES.fixed & preES.level < curExpo) then
parsing. The key idea of our solution is checking the expo-
13: preES createCandSeg(curExpo, i).
nent values for the data to compress and analyzing their
14: else
changes in the sequence, in order to partition them into dif- 15: preES mergedES.
ferent segments. Specifically, as long as an exponent of the 16: preES.length++.
current data value changes across the edge of the exponent- 17: end if
partitioned interval compared with that of the last data 18: else
point, we need to verify whether the amount of data col- 19: preES.length++.
lected is large enough for constructing a separate segment 20: end if
compared with the storage overhead (i.e., the extra storage 21: preExpo curExpo.
size introduced by recording the segment information for 22: end for
the data). If the sign of a data value is changed compared 23: Call backTrackParsing(preES, curExpo, h).
with its preceding data points and if the length of the current 24: Clear the whole segment set S, by merging the segment
segment is long enough, a candidate segment will also be whose value range size is smaller than D.
generated, otherwise, the change of signs will be ignored.
To illustrate the basic idea of our data-partitioning At the beginning (line 1) of the algorithm, the required
method, we give an example with n data points to compress. exponent value (denoted by reqExpo) is computed based on
As presented in Fig. 6, the data values span vertically differ- a user-specified error bound (denoted by D), in order to
ent exponent-partitioning intervals throughout the data set. determine the significant bits in the representation of the
Once some data point’s value (such as the data points i, j, k, p floating-point numbers. Specifically, reqExpo is equal to
shown in the figure) goes across the edge of an exponent-par- getExponent(D), where getExponent() is a function that
titioning interval compared with its preceding data value, extracts the exponent value from a floating-point number.
the data index is recorded, because the collected data set Next, the algorithm compares the exponent of each data
(such as [0, i], [i; j], [j; k], [k; p]) may construct a separate
segment. Note that the size of the candidate interval [k; p] is 2. The minimum segment storage overhead threshold is to avoid generat-
too small to obtain the gains of the storage-size saving ing segments in the data partitioning that are too small. Specifically,
since we need to keep the segment’s starting index (32 bits) and the seg-
against the storage overhead; hence, it should not be sepa- ment length (32 bits) to maintain each segment, the threshold is set to 64
rated but should be merged with its adjacent segments [j; k] (in bits) in our design.
136 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

Fig. 8. Demonstration of best-fit segmented median values.

Fig. 7. Illustration of segment merging. the highest level on the middle/current segment. In this situ-
ation, our merge function will simply return the current seg-
value (denoted by curExpo) and that of its preceding data ment. In the second group, the best-fit merging method is
value (denoted by preExpo) throughout the whole sequence right-merging, because otherwise the extra unnecessary
of the data. If the exponent of some data value is smaller higher level would be introduced, which might degrade the
than the reqExpo value (i.e., the data value itself is smaller compression factor for the current segment in turn. Let us
than the user-required error bound), its exponent will be take the case (d) as an example. Suppose that the current
flushed to the value of reqExpo (lines 5–7), because reqExpo segment were merged with left segment. Then the current
is the user-accepted exponent and thus can lead to more level lc would become lp instead of ln , leading to the larger
data being predictable by the curve-fitting. Then, the algo- value range for the lossy compression of the data in the cur-
rithm compares the values of curExpo and preExpo to deter- rent segment. This would, in turn, raise a larger deviation
mine whether the current data point index can be treated as with respect to the current segment, introducing coarser
a segment edge. Specifically, if curExpo is smaller than pre- compression granularity unexpectedly. Similarly, the best-fit
Expo (i.e., such as the data index i in Fig. 6), the current data segment merging method for all the cases in Group 3 is left-
index will be recorded by the algorithm, creating a candi- merging, which leads to the minimum exponent deviation
date segment (denoted by the function createCandSeg() in for the data in the current segment.
the pseudo code). By contrast, if the curExpo exhibits greater
values than preExpo (probably because of the sharp increase Algorithm 2. BackTrackParsing Algorithm
in the data value), then the algorithm will check whether
Input: the last candidate segment previously (denoted by
the previously created candidate segments are long enough
curES), the exponent of the current data value (denoted by cur-
to be treated as separate segments or should be merged Expo), and segment-storage-overhead threshold (denoted by h)
with other segments (line 11-17 in the pseudo code). The Output: The previously marked candidate segments are
details about this part are included in an iterative function, checked whether they should be merged or not).
backTrackParsing(), to be described later. The preES refers backTrackParsing(curES,curExpo,h)
to the preceding exponent-partitioned segment with respect 1: if (curES is fixed or curES is header) then
to the current data point. The rest of the code (lines 8–21) 2: return NULL.
updates the preES for checking the next data point; preES. 3: end if
fixed denotes whether the preES is already determined as a 4: preES the preceding segment of curES.
separate segment or not and preES.level refers to the corre- 5: mergedES merge(preES,curES,nextLevel,h).
sponding exponent value of the segment preES. The last 6: nextLevel curES.level.
step (line 24) of the algorithm checks each segment and 7: preES the preceding segment of the mergedES.
removes the one whose value range size is smaller than the 8: latestES backTrackParsing(preES,nextLevel,h).
error bound, because the data in this segment are all sup- 9: if (latestES is NULL) then
posed to be predictable. 10: return mergedES’s segment.
The backTrackparsing function aims to remove too short 11: else
candidate segments. In the example presented in Fig. 6, the 12: return latestES’s segment.
backTrackParsing() will be called at data points j, p, and n, 13: end if
respectively. Some of the candidate segments (such as [k; p]
shown in Fig. 6) will be merged with their preceding seg- The time complexity of our best-fit data-partitioning
ments because their sizes are too small compared with the algorithm is O(N): the algorithm needs to go over all data
segment storage overhead. The pseudo code of the back- points just once. In the iterative backTrackingParsing algo-
TrackParsing is presented in Algorithm 2. It tries merging rithm, each of the previously collected candidate segments
the current segment curES with its preceding segment by will also be checked only once. Also note that most of the
calling merge(preES, curES, nextLevel, h) iteratively. operations are working on short-type integers (i.e., expo-
The core of the backTrackParsing algorithm is the merge nent level), which means a fairly fast processing in practice.
function, which is illustrated in Fig. 7. There are eight possi- Fig. 8 shows that our partitioning algorithm can effec-
ble cases with regard to the different exponent levels of the tively split the data set into consecutive segments. The two
preceding segment (lp ), current segment (lc ), and next seg- data sets from Vortex and BlastBS are partitioned into 63
ment (ln ). All eight cases can be split into three groups. In the segments and 39 segments, respectively, such that the data
first group, the two types of segments exhibit a bump, with are all close to each other in every segment.
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 137

the optimized number of bits to shift (denoted by a) is set


 
to 8  b mod 8 + g when b mod 8  4. On the other hand,
note that the whole data set has been partitioned into
multiple segments, in each of which the data tend to have
the same exponent. Thus, the expected transformed data
value (such that the XOR-leading-zero length tend to be
Fig. 9. Illustration of optimized number of shifting bits.
identical) is equal to X + 2getExponentðXÞþa , in that the incre-
5.3 Optimizing the Shifting Offset ment 2getExponentðXÞþa will guarantee the transformed data
Since the XOR-leading-zero part is generally denoted by a values to have the same exponent and also shift the man-
two-bit code [11], [14], which represents the number of tissa rightward by a bits uniformly.
XOR-leading-zero bytes, the extra XOR-leading-zero bits Since the expected value of the transformed data
(i.e., number of XOR-leading-zero bits mod 8) have to be (X+ ) is supposed to be the above-derived one, i.e.,
stored exactly if the total XOR-leading-zero length is not a X+ =X+2getExponentðXÞþa , we get Equation (1). u
t
multiple of 8, leading to the extra storage sizes. This analysis
motivated us to introduce an offset onto each data point in We further illustrate the proof by using two single-
every segment, such that the XOR-leading-zero lengths of precision numbers 0.001234 and 0.001278, whose IEEE 754
the transformed data are equal to or slightly larger than a representations are 00,111,010 10,100,001 10,111,110 00,101,
multiple of 8 (a byte’s length in bits). 011 and 00,111,010 10,100,111 10,000,010 10,010,000, respec-
Specifically, we explore an optimal shifting offset value tively. Obviously, the XOR-leading-zero length is 13 because
(denoted by  ) for every segment. All the data Xi are con- their leftmost 13 bits are exactly the same. Therefore, we need
verted to Xi + . We derive the value of  in Theorem 1. to move the mantissa to the right by 3ð¼ 16  13) bits in order
The basic idea is to check the number of XOR-leading-zero to get the integer bytes of XOR-leading-zero length. To this
bits (denoted by b) for the unpredictable data estimated by end, adding them by the increment 2getExponentð0:001256Þþ3
applying the best-fit curve-fitting models on the data set respectively will lead to new transformed numbers 0.0090465
{X}. We expect to introduce an offset such that the numbers and 0.0090905, whose IEEE 754 representations are 00,111,
of XOR-leading-zero bits for the transformed data tend to 100 00,010,100 00,110,111 11,000,101 and 00,111,100 00,010,
be right equal or slightly higher than multiples of 8 (a byte’s 100 11,110,000 01,010,010, respectively, with XOR-leading-
length in bits). The reason is that such a situation can maxi- zero lengths right equal to 16 bits.
mize the saving gains in the XOR-leading-zero part. The parameter g should be lower than 4, in order to con-
trol the expected XOR-leading-zero lengths in [8k; 8k + 4].
Theorem 1. The optimal shifting offset for a segment of data {X} Its value is set to 1 in our experiments.
is derived below: We note that adding increments (i.e., shifting offsets)
onto the data may lead to more significant bits. To this end,
 ¼ 2getExponentðXÞþa (1) we must check each segment to see whether the total num-
ber of significant bits will exceed the bound of IEEE 754
where X denotes the mean data value of {X}, getExponent() is representation (64 for double precision and 32 for single
to return the exponent value of some floating-point number, precision). If yes, our compressor will not adopt the shifting
( offsets for that segment to guarantee error bound.
0; bmod 8 < 4
a¼  
8  b mod 8 þ g; bmod 8  4 5.4 Adaptive Lossy Compression
Based on the segmentation design and optimization of shift-
Here, bmod8 is the mean value of the (b mod 8) for unpredict-
ing offset for data transformation, the compression factor
able data, g is a small increment in order to maximize the aver-
can be improved significantly, as shown in Table 3. We
age XOR-leading-zero bytes for normalized data, and [ ] is
observe that the compression factor of our new solution SZ
floor function.
with bestfit Segmentation and optimized shifting offset for
Proof. As shown in Fig. 9, since b is the XOR-leading-zero XOR-leading-zero length (called SZ(w/S)) is higher than
length of the data point in the data set {X}, the expected that of our previous solution without the segmentation
extra number of XOR-leading-zero bits with respect to design and shifting offset optimization (abbreviated SZ(w/
the last significant byte involved is equal to the average oS)), by 20–72 percent in most cases (11 out of 13 bench-
value of (b mod 8). Since the XOR-leading-zero length is marks). However, SZ(w/S) may also lead to a slightly
counted by bytes (i.e., multiple of 8 bits), the expected degraded compression level in two benchmarks (9 percent
number of bits to be shifted
 to the edge of the integer in Blast2 and 3.8 percent in ATM), compared with SZ(w/
number of bytes is 8  b mod 8 . Note that the distribution oS)). Since Blast2 is an easy-to-compress case, the 9 percent
of the total number of XOR-leading-zero bits is supposed degradation on its compression factor will have little impact
to follow a normal distribution based on the central limit on the overall execution performance: less than 1 percent
theorem [28], especially when the amount of unpredict- performance difference as analyzed in Section 1. In contrast,
able data is fairly large. Hence, we must introduce an ATM is a typical hard-to-compress data, so the degradation
increment (denoted g) to further shift the bits such that of its compression factor may impact the execution perfor-
the expected XOR-leading-zero lengths for most of the mance to a certain extent. Based on our analysis, the key rea-
data are in [8k; 8k + 4], where k is an integer. To this end, son for its compression-ratio degradation is that the original
138 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

Fig. 10. Compression factors of variables in three snapshots (ATM).

data may already lead to proper XOR-leading-zero lengths. compressor, and we call them SZ(w/oS) [11] and SZ(MD +
That is, the extra segmentation and optimization of shifting Q) [12] respectively. The detailed experiment setting of the
offsets may not improve compression factors but may even parameters used by SZ(w/oS) is consistent with that of the
degrade them because of the inevitable overhead. We note corresponding paper [11]. SZ(MD + Q) [12] is a rather new
that this is a unique case where the SZ(w/oS) happens to be version based on SZ model, which improves the prediction
close enough to optimal. accuracy at the data prediction step such that the unpredict-
To improve the compression factor in all situations, we able ratio could be reduced as much as possible. Specifi-
devised an adaptive compression method (namely SZ(Ada)) cally, it adopts multi-dimensional prediction instead of one-
by combining the SZ(w/S) and SZ(w/oS). We selected the dimensional prediction, and also adopts an error-controlled
best-fit solutions for different variables adaptively. Such a quantization method to encode the prediction values. As for
design is motivated by our observation that various com- SZ(MD + Q), we set the number of quantization bins to 128
pressors lead to very close compression factors on the same for all the benchmarks except for ATM, on which we set it
variables with short-distance snapshots. Fig. 10 presents the to 65,536, in order to reach a high compression factor con-
compression factors on 24 variables in three snapshots (time sidering the overhead of storing Huffman tree. If there are
step 15, 30, and 45), with respect to the benchmark ATM. multiple variables in a snapshot, we perform data predic-
We observe that the compression factor does not differ sig- tion and encoding on each variable and then perform Gzip
nificantly with different snapshots for the same variables. compression for all variables together in this snapshot.
Based on this analysis, our adaptive method SZ(Ada) per- We evaluate the compression quality based on the 13
forms either SZ(w/S)) or SZ(w/oS) on the compression of benchmarks listed in Table 1. The experiment setting for the
each variable in every snapshot, and the best-fit compressor 13 benchmarks can be found in Section 3. In our experi-
is checked periodically (every 20 snapshots in our implemen- ments, we adopt two important data-distortion metrics,
tation) and recorded in a bit-mask array: each bit represents maximum compression error and peak signal-to-noise ratio
either SZ(w/S) or SZ(w/oS) for a variable. Since the two sol- (PSNR), to evaluate the peak compression error and overall
utions have similar compression/decompression times (to compression error respectively. PSNR is defined as follows:
be shown later), the total compression/decompression time
of SZ(Ada) may increase little because of the periodic best-fit PSNR ¼ 20  log 10 ðvalue rangeÞ  10  log 10 ðMSEÞ: (2)
compressor checking (e.g., only 1/20 increment if the check-
ing period is 20 snapshots). Such an adaptive design can sig- where value_range and MSE refer to data value range and
nificantly improve the compression factors by up to 40 the mean squared compression error respectively.
percent in hard-to-compress cases, while still guaranteeing
the user-specified error bounds (shown in next section). 6.2 Experimental Results
6.2.1 Compression Factor
6 EVALUATION OF COMPRESSION QUALITY Table 4 presents the compression factors of 10 state-of-the-
art compressors based on a total of 13 benchmarks (note
We first describe the experimental setup used in the evalua- that ISA, ISA + Gzip, and SSEM are not error-bounded com-
tion and then present the evaluation results by comparing pressors). As highlighted in the table, SZ(Ada) leads to the
our solution with nine other state-of-the-art compressors. highest compression factors in most cases (8 out of 13
benchmarks). Its compression factor is even higher than the
6.1 Experimental Setup non-error-bounded compressors such as ISABELA and
In our experiments, we compared our approach with other SSEM. In absolute terms, SZ(Ada) improves the compres-
state-of-the-art compressors, including lossless compressors sion factors by up to 107 percent over our previous work SZ
such as Gzip and FPC and lossy compressors such as SZ, (w/oS), and by up to 49 percent for hard-to-compress cases.
ZFP(0.5.0), ISABELA, and Sasaki et al.’s approach (here The key reason SZ(Ada) can obtain such a significant
referred to as SSEM, based on the authors’ last names). improvement is that it can adaptively select SZ(w/S) in
A brief description of these compressors can be found in terms of variables at runtime for the hard-to-compress data
Section 7. There are two versions for the existing SZ and choose SZ(w/oS) when SZ(w/S)’s segmentation
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 139

TABLE 4
Compression Factors (the Percentages in Parentheses Refer to the Improvement Compared with Our Previous Work SZ(w/oS) [11])

Benchmark SZ(Ada) SZ(w/oS) SZ(MP+Q) ZFP ZFP+Gzip ISA ISA+Gzip SSEMa FPZIP-40b Gzip FPCc
Blast2 138 (25%) 110 211.8 6.8 36.2 4.56 46.2 39.7 22.9 77 11.4
Sedov 8.75 (17.6%) 7.44 7.84 5.99 7.06 4.42 7.44 17d 3.43 3.13 1.9
BlastBS 4.12 (26.4%) 3.26 4.0 3.65 3.78 4.43 5.06 8.45 2.43 1.24 1.29
Eddy 12.14 (49%) 8.13 11.87 8.96 9.53 4.34 5.18 N/A 2.56 5.5 3.89
Vortex 28.1 (107%) 13.6 21.29 10.9 12.2 4.43 4.72 12 3.35 2.23 2.34
BrioWu 104 (46%) 71.2 104.85 8.24 49.1 5 57.4 35.7 21.9 73 8.5
GALLEX 237 (29%) 183.6 255.1 36.7 92.7 4.89 33.6 82.4 20.35 34.7 11.37
MacLaurin 136 (17.2%) 116 110 21.77 31.4 4.1 5.47 7.44 3.84 2.03 2.08
Orbit 537 (24%) 433 433.2 85 157 4.96 8.43 11.7 3.9 1.8 1.86
ShafranovShock 54 (12.5%) 48 47 4.43 29.5 4.24 12.2 20.3 19.9 28 7.33
CICE 6.87 (26.5%) 5.43 6.71 5.23 5.54 4.19 4.46 3.83 2.3 2.6 2.67
ATM 4.27 (8.1%) 3.95 4.97 3.17 3.49 3.1 3.7 1.82 1.04 1.36 N/A
a
SSEM cannot work on Eddy because it requires each dimension to be an even size whereas Eddy data are 128  32  5  5 for each variable.
b
FPZIP-40 means 40 bits from among 64 bits are extracted to store for each floating-point data point. (For ATM, FPZIP adopts FPZIP-30 instead because FPZIP
does not support the precision of 40 bits for them.)
c
FPC cannot work on ATM because it does not support single-precision floating-point compression.
d
Note that SSEM does not respect the error-bound as confirmed in [11].

Fig. 11. Maximum compression errors of error-bounded lossy compressors (error bound is set to 106 ).

Fig. 12. Rate distortion results of different compressors.

overhead is relatively huge compared with the compressed (106 as set in our experiments). Note that SZ(w/oS) and
size. We also note that SZ(MD + Q) works better than SZ ZFP both over-preserve the precision with varying degrees,
(Ada) in the ATM data set. The reason is that SZ(MD + Q) compared with the specified error bound. Specifically, the
adopts a multi-dimensional prediction method, which may compression errors with SZ(w/oS) are within [2  107 ; 106 ]
significantly reduce the number of unpredictable data. How for a vast majority of data, and ZFP’s compression errors
to integrate the advantage of SZ(MD + Q) and SZ(Ada) will are within [1  107 ; 4 107 ] for most of the data. In com-
be included in our future work. parison, SZ(Ada)’s compression errors are about 9  107
for majority of data, which can explain why SZ(Ada) works
6.2.2 Compression Error better than the other two to a certain extent.
In Fig. 11, we present the maximum compression errors cal-
culated after decompressing all the data for the three error- 6.2.3 Rate Distortion
bounded compressors: SZ(Ada), SZ(w/oS), and ZFP. We In Fig. 12, we present the rate-distortion results of five differ-
clearly observe that the three lossy compressors are all able ent compression techniques, including SZ(adaptive), SZ(with
to restrict the compression errors within the error bound segments), SZ(MD + Q), ZFP [15], and ZFP + Gzip, for three
140 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

TABLE 5 TABLE 6
Compression Performance (in MB/s) Decompression Performance (in MB/s)

ISA. ZFP ZFP+G SZ(w/S) SZ(MD+Q) SZ(A) Gzip ISA. ZFP ZFP+G SZ(w/S) SZ(MD+Q) SZ(A) Gzip
Blast2 6.1 98.4 52.8 90.5 140.5 88.4 107 Blast2 22.5 105 88.4 124.9 140.5 117.5 145.7
Sedov 5.74 71 45.7 49.3 52.4 47.8 20.2 Sedov 22.5 71 64.1 103.1 117.8 101.5 65
BlastBS 13.44 57.9 35.4 48 50 43 11.7 BlastBS 22.4 170 129.5 234.3 259 224 81
Eddy 5.73 47.1 25.7 45.8 60.7 45.6 53.7 Eddy 23 46.9 41.8 44.6 48.8 43.4 44.3
Vortex 5.37 67.4 38.4 78.4 80.6 73.4 31.5 Vortex 23.3 60.4 54.4 59.8 74.4 59.2 55.2
BrioWu 8.33 114.6 69.2 82.1 97.3 79.1 50.9 BrioWu 24 102.8 97.3 97.3 116.4 94.8 73.3
GALLEX 8.7 270 193 119 138.5 108 51.4 GALLEX 24.1 245.5 225 245.5 270 245.5 36
MacLau. 7.1 225 175 192.1 211.4 182.6 24.32 MacLau. 22.7 120 111.5 112.5 121.2 110.5 64.3
Orbit 8 217.1 168.9 200 226.9 183.1 14.9 Orbit 23 217.1 203 205 211.1 200 20.3
Shaf.Sh. 6.42 50.2 23.7 43.9 47.3 41.7 24 Shaf.Sh. 19.4 43.9 39.7 46.7 49.4 43.9 31.5
CICE 4.7 52.7 29.4 49.6 52.6 47.4 54.9 CICE 22.8 65.8 38 68.3 75.2 65.4 67.3
ATM N/A 58.6 24.4 38.4 41.4 36.6 22.9 ATM N/A 54.7 47.6 192.6 156.6 150.2 216.45

typical benchmarks due to the space limit of the paper. These ISABELA suffers from the highest compression cost because
three benchmarks are representatives of different research of its slow data-sorting step. The other four compressors
domains (Sedov is a shock simulation, CICE is a climate simu- exhibit the similar level of compression rate. Specifically,
lation, and HACC is a cosmology simulation). As for rate- ZFP exhibits the best compression performance in general.
distortion, rate is also known as bit-rate, refering to the number Note that SZ includes a lossless compression step (Gzip),
of bits used to represent a data point on average during the which may take a major portion in the total execution time.
compression (the smaller the better). Distortion is assessed Based on a breakdown of execution times, the Gzip step
using peak signal-to-noise ratio, which is a common criterion to takes about 30-50 percent on the compression time for SZ in
assess the overall compression error (the higher the better). most of cases. ZFP + Gzip will lead to much lower compres-
Based on the three figures, we note that SZ(ada) leads to the sion rate as shown in Table 5, because of its larger com-
best results with respect to the first two cases, and its bit-rate pressed size after its original compression.
is less than the second best compressor SZ(w/o segment) by The decompression performance is presented in Table 6.
10 percent and 50 percent on CICE and Sedov respectively. Similar to the compression performance, ISABELA suffers
The reason is three-fold: (1) we adapt an adaptive solution the lowest decompression rate (i.e., higher decompression
that selects the best-fit options for different variables dynami- time). We also observe that for all the benchmarks, SZ(w/
cally; (2) we optimize the unpredictable data compression S)’s decompression performance is close to that of ZFP in
using segmented offset-shifting method, which can improve most cases. The key reason that SZ(w/S) works fast on
the compression factor for hard-to-compress cases in particu- decompression is that it just needs to decode the best-fit
lar; (3) the Gzip step is performed on all variables (10 variables curve-fitting type and rebuilds the unpredictable data by
in Sedov and 5 variables in CICE) after each variable data is bitwise operations. SZ(Ada)’s decompression performance
separately processed with the previous SZ compression steps is close to that of SZ(w/S) because it adopts either SZ(w/
(including predictable data compression and unpredictable oS) or SZ(w/S) for each variable adaptively.
data compression). Since we have only one snapshot of We compare the performance of processing the cosmol-
HACC data set, we cannot evaluate SZ(Ada) in this case. ogy simulation data [1] with our compressor against the I/
Fig. 12c shows that SZ(w/ segment) has the similar rate- O performance without the compressor, as shown in Table 7.
distortion result with ZFP+Gzip, and it is less than the original We emulate the course of the in-situ compression at run-
ZFP compressor by 2 3 bits per data point. The reason ZFP time, by splitting the cosmology data into multiple pieces
may not work very well on HACC data set is that HACC data
are composed of multiple 1D arrays each representing parti- TABLE 7
cles’ partial information (such as coordinate value in one Parallel Processing Time (in Seconds): Cmpres Refers to
dimension) such that the adjacent data values in each 1D Compression, wr_cmpres_data Refers to Write Compression
array have no clear coherence, on which whereas ZFP Data, rd_cmpres_data Means Read Compression Data,
depends a lot. We also observe that SZ(MD + Q) exhibits the and Decmpres Refers to Decompression
best rate-distortion result on HACC data, because it can
#cores cmpres wr_cmpres_data sum rd_cmpres_data decmpres sum
significantly reduce the unpredictable ratio. One reason SZ
1 126.8 0.8 127.6 35.3 1.2 36.5
(MD + Q) may not work very effectively on Sedov and CICE
2 65 0.62 65.62 18.16 0.6 18.76
is that each snapshot in the FLASH and Nek5000 benchmarks
4 34 0.63 34.63 9.52 0.9 10.42
is relatively small, such that the constant Huffman-tree-stor- 8 18.38 0.8 19.18 5.23 0.69 5.92
ing overhead in SZ(MD + Q) is prominent. 16 9.38 0.82 10.2 2.62 0.35 2.97
32 4.79 0.7 5.49 1.36 0.26 1.62
64 2.46 0.8 3.26 0.73 0.15 0.88
6.2.4 Performance of Compression and 128 1.27 0.8 2.07 0.36 0.1 0.46
Decompression 256 0.86 0.79 1.65 0.21 0.07 0.28
512 0.44 0.8 1.24 0.1 0.04 0.14
We present in Table 5 the compression performance (MB/s) 1024 0.3 0.5 0.8 0.074 0.02 0.094
based on all snapshots for each benchmark. We note that
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 141

and performing the compression in parallel by different representation (BA), and Gzip lossless compression (Gzip).
ranks under an MPI program before storing the data into NUMARCK, for example, approximates the differences
the PFS. The simulcation scale ranges from 1 cores through between snapshots by vector quantization. ISABELA con-
1024 cores, which are from Argonne Blues cluster [13]. The verts the multidimensional data to a sorted data series and
writing time and reading time of the original data set (3.5 then performs B-spline interpolation. ZFP involves more
GB) through the parallel file system are, respectively, 4.9 complicated techniques such as fixed-point integer conver-
seconds and 4.1 seconds on average based on our experi- sion, block transform, and binary representation analysis
ments. Based on Table 7, we can see that the compression with bit-plane encoding. Fpzip adopts predictive coding
time and decompression time both decrease linearly with and also ignores insignificant bit planes in the mantissa
the number of cores. When the running scale is increased based on the analysis of IEEE 754 binary representation.
to 64 cores, the total overhead of writing data (i.e., com- SSEM splits data into a high-frequency part and low-
pression time + writing time = 3.26 seconds) already gets frequency part by wavelet transform and then uses vector
much lower than the time of writing the original data set quantization and Gzip. SZ is an error-bounded lossy com-
(4.9 seconds). When the parallel scale of the simulation is pressor proposed in [11]; it comprises four compression
up to 1024 cores, the overhead of writing data is down to steps as described in Section 4.1. In addition, we recently
only 1/5 of the time of writing the original data set. The improved prediction accuracy by adopting multi-prediction
data reading overhead will be less than 1/40 (0.094 second and error-controlled quantization model [12]. We compare
versus 4.1 seconds) of the time of reading the original data the compression techniques proposed in this paper to that
set, which is a significant improvement for the simulation approach as well, and we observe that the new solution out-
performance at runtime. The key reason for the high per- performs that one in most of cases, especially on rate-distor-
formance gain with respect to the reduction of data writ- tion metric.
ing/reading overhead is two-fold: on the one hand, the We presented in the preceding section the evaluation
compression/decompression time significantly decreases results by comparing our solution with all of the available
with the increasing number of cores (linearly) because of compressors3, using 13 applications across different scien-
no communication cost among different ranks; on the tific domains. Our new solution leads to significantly higher
other, the compressed size is much less than the original compression factors with comparable compression/decom-
data size, leading to a much lower I/O time cost. With val- pression times, and it also guarantees the user-specified
ue_range based relative error bound = 1E-4, the com- error bound.
pressed factor under our compressor is 2.73, compared In addition to the mesh-data based compressor, there are
with 1.48 under ZFP0.5.0 and 1.2 under Gzip. some other lossy compressors tailored for particular scien-
tific simulations: particle data compression related to molec-
7 RELATED WORK ular dynamics research or cosmology simulation is one
typical example. In this type of simulation, a very large
HPC data compressors can be split into two categories: loss- number of particles are simulated/anlayzed, and the key
less compressor [7], [14], [16] and lossy compressor [8], [9], information includes both position and velocity of each par-
[11], [12], [15], [19]. Lossless compressors can be further split ticle in three dimensions. How to compress the particle data
into general data compressors and floating-point data com- very effectively has been studied for years [31], [32], [33],
pressors. The former can compress any type of data stream, [34]. However, almost all of the related compressors are
including video streams. A typical example is Gzip [7], designed based on the trajectory analysis of the individual
which integrates the LZ77 [29] algorithm and Huffman particles along time steps, which requires the users to load/
encoding [30]. LZ77 algorithm makes use of a sliding win- keep multiple snapshots during the compression. This is
dow to search the same repeated sequences of the data and impractical when the number of particles is extremely large
replace them with references to only one single copy existing because of limited memory capacity to use. By contrast, our
earlier in the data stream. Huffman encoding [30] is an compressor allows to compress the snapshots separately,
Entropy-based lossless compression scheme which assigns which is very critical to the large-scale particle simulation
each symbol in the data stream a unique prefix-free code. that requires in-situ compression at runtime.
Floating-point data compressors compress a set of floating-
point numbers by analyzing the IEEE 754 binary representa- 8 CONCLUSION AND FUTURE WORK
tions of the data one by one. Typical examples include FPC
[14] and Fpzip [16], which leverage finite context models and In this paper, we present a novel error-bounded HPC
predictive coding of floating-point data, respectively. The floating-point data compressor. We propose an optimized
algorithm that can adaptively partition the data into a set of
common issue of such lossless compression methods is the
best-fit consecutive segmentations and also optimize the
relatively low compression ratio, which will significantly
shifting offset for the data transformation such that XOR-
limit the performance of the runtime data processing or post-
leading-zero lengths can be maximized. Our compressor
processing especially for exascale scientific simulation.
supports C and Fortran; and it can be downloaded under a
In recent years, many lossy compressors have been pro-
BSD license. Key findings are threefold:
posed to significantly reduce the data reading/writing cost
for large-scale HPC applications. Existing state-of-the-art
3. We did not include NUMARCK, because of three factors: (1) its
compressors often combine multiple strategies, such as vec-
code is unavailable to download; (2) it does not respect error bound as
tor quantization (VQ), orthogonal transform, curve-fitting shown in [11]; and (3) its compression ratio is not competitive with
approximation (CFA), analysis of floating-point binary others [11].
142 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018

 Its compression factor range is [2.82,538], which is [14] M. Burtscher and P. Ratanaworabhan, “High throughput com-
pression of double-precision floating-point data,” in Proc. Data
higher than many related lossy compressors in most Compression Conf., 2007, pp. 293–302.
of cases based on our experiments with 10+ bench- [15] P. Lindstrom, “Fixed-rate compressed floating-point arrays,” IEEE
marks across multiple research domains. Trans. Vis. Comput. Graph., vol. 20, no. 12, pp. 2674–2683, Dec.
 The compression errors are always strictly limited in 2014.
[16] P. Lindstrom and M. Isenburg, “Fast and efficient compression of
the user-specified error bound. floating-point data,” IEEE Trans. Vis. Comput. Graph., vol. 12,
 Its compression/decompression performance is com- no. 5, pp. 1245–1250, Sep.–Oct. 2006.
parable to those of other techniques. [17] ASCF Center, “FLASH User’s Guide (Version 4.2).” (2014).
[Online]. Available: http://flash.uchicago.edu/site/flashcode/
In future work, we plan to further explore new ideas user_support/flash4_ug_4p3.pdf
to improve the compression factors, e.g., by combining SZ [18] P. Fisher, “Nek5000 user guide.” (2010). [Online]. Available:
(MD + Q) and the techniques proposed in this paper. We http://www.mcs.anl.gov/ fischer/nek5000/examples.pdf
also plan to study the relationship between compression [19] Z. Chen, S. W. Son, W. Hendrix, A. Agrawal, W. Liao, and
A. Choudhary, “NUMARCK: Machine learning algorithm for
factor and the error bound, and support common HPC data resiliency and checkpointing,” in Proc. IEEE/ACM Supercomputing
formats such as netCDF and HDF5. Int. Conf. High Performance Comput. Netw. Storage Anal., 2014,
pp. 733–744.
[20] P. Colella and P. R. Woodward, “The piecewise parabolic method
ACKNOWLEDGMENTS (PPM) for gas-dynamical simulations,” J. Comput. Physics, vol. 54,
This research was supported by the Exascale Computing pp. 174–201, 1984.
[21] L. I. Sedov, Similarity and Dimensional Methods in Mechanics (10th
Project (ECP), Project Number: 17-SC-20-SC, a collaborative ed.). New York, NY, USA: Academic Press, 1959.
effort of two DOE organizations - the Office of Science and [22] A. L. Zachary, A. Malagoli, and P. Colella, “A higher-order godu-
the National Nuclear Security Administration, responsible nov method for multidimensional ideal magnetohydrodynamics,”
SIAM J. Scientific Comput., vol. 15, no. 2, pp. 263–284, 1994.
for the planning and preparation of a capable exascale [23] O. Walsh, “Eddy solutions of the navier-stokes equations,” in
ecosystem, including software, applications, hardware, Proc. Navier-Stokes Equations II - - Theory Numerical Methods, 1991,
advanced system engineering and early testbed platforms, pp. 306–309.
to support the nations exascale computing imperative. The [24] M. Brio and C. C. Wu, “An upwind differencing scheme for the
equations of ideal magnetohydrodynamics,” J. Comput. Physics,
submitted manuscript has been created by UChicago vol. 75, pp. 400–422, 1988.
Argonne, LLC, Operator of Argonne National Laboratory [25] A. Obabko, “Simulation of gallium experiment.” (2005). [Online].
(Argonne). Argonne, a U.S. Department of Energy Office of Available: http://www.cmso.info/ cmsopdf/princeton5oct05/
talks/Obabko-05.ppt
Science laboratory, is operated under Contract No. DE-
[26] V. D. Shafranov, “The structure of shock waves in a plasma,” Sov.
AC02-06CH11357. Phys. JETP, vol. 5, 1957, Art. no. 1183.
[27] D. Bailey, et al., “Community Ice CodE (CICE) user’s guide (ver-
sion 4.0).” [Online]. Available: http://www.cesm.ucar.edu/
REFERENCES models/ccsm4.0/cice/ ice_usrdoc.pdf
[1] S. Habib, V. Morozov, N. Frontiere, H. Finkel, A. Pope, and [28] J. Rice, Mathematical Statistics and Data Analysis (2nd edition).
K. Heitmann, “HACC: Extreme scaling and performance across Pacific Grove, CA, USA: Duxbury Press, 1995.
diverse architectures,” in Proc. Int. Conf. High Performance Comput. [29] J. Ziv and A. Lempel, “A universal algorithm for sequential data
Netwo. Storage Anal., 2013, pp. 1–10. compression,” IEEE Trans. Inf. Theory, vol. 23, no. 3, pp. 337–343,
[2] Community Earth Simulation Model (CESM). [Online]. Available: May 1977.
https://www2.cesm.ucar.edu/ [30] D. Huffman, “A method for the construction of minimum-redun-
[3] A. H. Baker, H. Xu, J. M. Dennis, M. N. Levy, D. Nychka, and S. A. dancy codes,” Proc. IRE, vol. 40, no. 9, pp. 1098–1101, Sep. 1952.
Mickelson, “A methodology for evaluating the impact of data [31] H. Ohtani, K. Hagita, A. M. Ito, T. Kato, T. Saitoh, and T. Takeda,
compression on climate simulation data,” in Proc. ACM 23rd Int. “Irreversible data compression concepts with polynomial fitting
Symp. High-Performance Parallel Distrib. Comput., 2014, pp. 203–214. in time-order of particle trajectory for visualization of huge parti-
[4] K. Paul, S. Mickelson, J. M. Dennis, H. Xu, and D. Brown, “Light- cle system,” in Proc. J. Physics: Conf. Series, vol. 45, no. 1, pp. 1–11.
weight parallel python tools for earth system modeling work- [32] D. Y. Yang, A. Grama, and V. Sarin, “Bounded-error compression
flows,” in Proc. IEEE Int. Conf. Big Data, 2015, pp. 1985–1994. of particle data from hierarchical approximate methods,” in Proc.
[5] Earch System Grid (ESG). [Online]. Available: https://www. IEEE/ACM Supercomput. Int. Conf. High Performance Comput. Netw.
earthsystemgrid.org/home.htm Storage Anal., 1999, Art. no. 32.
[6] A. H. Baker, et al., “Evaluating lossy data compression on climate [33] K. Hagita, T. Takeda, T. Kato, H. Ohtani, and S. Ishiguro,
simulation data within a large ensemble,” J. Geoscientific Model “Efficient data compression of time series of particles’ positions
Develop. Discussions, vol. 2016, pp. 1–38, 2016. for high-throughput animated visualization,” Proc. IEEE/ACM
[7] Gzip compression. [Online]. Available: http://www.gzip.org Supercomput. Int. Conf. High Performance Comput. Netw. Storage
[8] N. Sasaki, K. Sato, T. Endo, and S. Matsuoka, “Exploration of lossy Anal., 2013, pp. 1–2.
compression for application-level checkpoint/restart,” in Proc. [34] A. Kumar, X. Zhu, Y. Tu, and S. Pandit, “Compression in molecu-
IEEE 29th Parallel Distrib. Process. Symp., Proceed. Int., 2015, lar simulation datasets,” in Proc. Int. Conf. Intell. Sci. Big Data Eng.,
pp. 914–922. 2013, pp. 22–29.
[9] S. Lakshminarasimhan, et al., “Compressing the Incompressible
with ISABELA: In-situ Reduction of spatio-temporal data,” in
Proc. 17th Euro-Par11, 2011, pp. 366–379.
[10] Argonne MIRA system. [Online]. Available: http://www.alcf.anl.
gov/mira
[11] S. Di and F. Cappello, “Fast error-bounded lossy HPC data com-
pression with SZ,” in Proc. IEEE 30th Parallel Distrib. Process.
Symp., Proceed. Int., 2016, pp. 730–739.
[12] D. Tao, S. Di, Z. Chen, and F. Cappello, “Significantly improving
lossy compression for scientific data sets based onmultidimen-
sional prediction and error-controlled quantization,” in Proc. 31th
Int. Parallel Distrib. Process. Symp., 2017, pp. 1129–1139.
[13] Blues Cluster. [Online]. Available: http://www.lcrc.anl.gov/
DI AND CAPPELLO: OPTIMIZATION OF ERROR-BOUNDED LOSSY COMPRESSION FOR HARD-TO-COMPRESS HPC DATA 143

Sheng Di received the master’s degree from the Franck Cappello is a program manager and
Huazhong University of Science and Technology, senior computer scientist at ANL. Before moving
in 2007 and the PhD degree from the University to ANL, he held a joint position at Inria and the
of Hong Kong, in 2011. He is currently an assis- University of Illinois at Urbana Champaign, where
tant computer scientist with Argonne National he initiated and co-directed from 2009 the Inria
Laboratory. His research interest involves resil- Illinois-ANL Joint Laboratory on Petascale Com-
ience on high-performance computing (such as puting. Until 2008, he led a team at Inria, where
silent data corruption, optimization checkpoint he initiated the XtremWeb (Desktop Grid) and
model, and in-situ data compression) and broad MPICH-V (fault-tolerant MPI) projects. From
research topics on cloud computing (including 2003 to 2008, he initiated and directed the
optimization of resource allocation, cloud network Grid5000 project, a nationwide computer science
topology, and prediction of cloud workload/hostload). He is working on platform for research in large-scale distributed systems. He has auth-
multiple HPC projects, such as detection of silent data corruption, char- ored papers in the domains of fault tolerance, high-performance comput-
acterization of failures and faults for HPC systems, and optimization of ing, Grids and contributed to more than 70 program committees. He is
mutlilevel checkpoint models. Contact him at [email protected]. He is a an editorial board member of the IEEE Transactions on Parallel and Dis-
member of the IEEE. tributed Systems, the International Journal on Grid Computing, the Jour-
nal of Grid and Utility Computing, and the Journal of Cluster Computing.
He is/was program co-chair of the IEEE CCGRID 2017, Award chair of
the ACM/IEEE SC15, Program co-chair of the ACM HPDC2014, Test of
time award chair of the IEEE/ACM SC13, Tutorial co-chair of the IEEE/
ACM SC12, Technical papers co-chair of the IEEE/ACM SC11, Program
chair of HiPC2011, pPogram co-chair of the IEEE CCGRID 2009, Pro-
gram Area co-chair of the IEEE/ACM SC09, General chair of the IEEE
HPDC 2006. He is fellow of the IEEE and a member of the ACM. Contact
him at [email protected].

" For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

You might also like