A Comparative Analysis in Apache IoTDB
A Comparative Analysis in Apache IoTDB
Not only the vast applications but also the distinct features of time
series data stimulate the booming growth of time series database
management systems, such as Apache IoTDB, InfluxDB, OpenTSDB
and so on. Almost all these systems employ columnar storage, with
s_0
s_0
effective encoding of time series data. Given the distinct features
of various time series data, it is not surprising that different en-
coding strategies may perform variously. In this study, we first
summarize the features of time series data that may affect encod-
Sensor Sensor
value
s_0
2148
In this paper, we present a comparative analysis of time series Table 1: Numerical data features
data encoding techniques in Apache IoTDB, an open-source time
series database developed in our preliminary studies [43]. Our ma- Category Notation Feature
jor contributions are summarized as follows. Scale Mean(TS) Mean of values
(1) We summarize several time series data features that may af- Scale Var(TS) Variance of values
fect the performance of encoding in Section 2. Intuitively, as illus- Scale Spread(TS) Maximum minus minimum of values
trated in Figure 1, the scale of values is obviously an important Delta Mean(DS) Mean of deltas
factor of storage. Likewise, when storing the delta between two Delta Var(DS) Variance of deltas
consecutive values, it becomes a key issue. The number of value Delta Spread(DS) Maximum minus minimum of deltas
repeats and increases are also essential to some encoding ideas. Repeat Count(RS) Count of consecutive repeats
(2) We present a qualitative analysis of encoding effectiveness
Increase Count(IS) Count of increases
regarding to various data features in Section 4.
While there is no winner in all the data features, TS_2DIFF per-
forms well in a number of cases. For the cases where TS_2DIFF features can be directly calculated in Apache IoTDB by the data
shows worse results, such as repeat rate 1 in Figure 17, it may be profiling tools developed in our previous study [7].
less frequent and not that significant in practice.
(3) We devise a benchmark for time series data encoding. It con- 2.1 Scale
sists of (a) data generators for simulating various data features, (b) The scale of data is one of the most important factors in storage. In
several real-world datasets, public or collected by our industrial general, the larger the values are, the more bits we need to encode
partners, (c) metrics such as compression ratio (space cost after them. As illustrated in Section 4 below, the run-length based algo-
encoding and compressing divided by original space cost). In par- rithms [27] need to store the header, where more bits are needed
ticular, multiple features could vary at the same time in the gener- for larger values. Bit-packing algorithms [35] are similarly affected.
ator, such as large values but small deltas, so that the distinct cases Besides, when most values are negative, bit-packing based algo-
favored by different algorithms could be illustrated. Moreover, dif- rithms performs bad since sign bits are 1. To this end, we em-
ferent data types are supported, including INT32, INT64, FLOAT ploy the mean, variance and spread (maximum minus minimum)
and DOUBLE of numerical values, as well as text values. of the values in time series TS, denoted by Mean(TS), Var(TS), and
(4) We conduct an extensive experimental evaluation in Section Spread(TS) to represent the scale features.
7. The quantitative analysis generally verifies the aforesaid quali-
tative analysis of encoding performance regarding to various data 2.2 Delta
features.
The delta features show the amplitude of data fluctuations, partic-
Finally, we also discuss some related work in Section 8, and out-
ularly important to time series. Let DS = [𝑣 2 − 𝑣 1, 𝑣 3 − 𝑣 2, . . . , 𝑣𝑛 −
line some future directions in Section 9 referring to the analysis.
𝑣𝑛−1 ] denote the delta series of the time series DS, measuring the
The source code of encoding algorithms has been deployed in the
deltas of time-adjacent values. The differential-based algorithms
GitHub repository of Apache IoTDB [5]. The experiment related
[40] as introduced in Section 4.1 store these deltas. In this sense,
code and data are available in [6].
we use Mean(DS), Var(DS), and Spread(DS), mean, variance, and
spread (maximum minus minimum) of deltas, to evaluate how large
the deltas could be. It is worth noting that to some extent, Var(TS)
2 NUMERICAL DATA FEATURES also reflects the delta features, and likewise, Var(DS) understands
delta of deltas, important to some encoding such as TS_2DIFF dis-
We select several features that may affect the performance of time
cussed in Section 4.1.
series encoding. As illustrated in Section 4, we have three types
of lossless encoding algorithms, RLE-based, Diff-based and hybrid.
For all the algorithms, scale is an important feature, since tech-
2.3 Repeat
niques like bit-packing are widely used. Diff-based algorithms fa- Repetitive values are widely observed in time series, such as un-
vor small changes in consecutive values, and thus the delta feature changed temperature reading in several minutes. Such consecutive
is considered. RLE-based algorithms handle repeating contents in repeats can be compressed by run-length based algorithms [27] in
consecutive values, i.e., the repeat feature. Finally, we notice that Section 4.2. They also output zero values in XOR operators intro-
the signal bit of negative delta also affects the performance of bit duced in Section 4.1, shrinking the space efficiently. To this end,
compression, leading to the increase feature of consecutive values. we introduce a method to describe the repeats of time series. The
In addition to the aforesaid four features, there do exist others main idea is to count the number of consecutively repeated values
for consideration. For instance, signal-to-noise ratio (SNR) could which are in the interval of consecutive repetitive values. We de-
be considered in frequency-domain-based compression [46], which fine RS the repeat count series of time series TS = [𝑟 1, 𝑟 2, . . . , 𝑟𝑛 ],
however is lossy and out the scope of this study. Moreover, besides having
the numeral values, we further introduce two other features on r𝑖−1 + 1, if 𝑣𝑖 = 𝑣𝑖−1,
r𝑖 = (1)
value and character for the text data type in Section 3. 1, otherwise 𝑣𝑖 ≠ 𝑣𝑖−1
Table 1 outlines the major features. For simplicity, we use TS = for 1 < 𝑖 ≤ 𝑛 and r1 = 1. Algorithms like SPRINTZ [20] in Section
[𝑣 1, 𝑣 2, . . . , 𝑣𝑛 ] to denote the value list of time series. Most of these 4.3 have a block size of 8 for bit-packing numbers into integer bytes.
2149
Table 2: Text data features Table 3: Properties of numerical data encoding algorithms
2 2 2 1 -1 2 -1 1 ···
Second delta
2.4 Increase encoding IEJ×ÜÙÙ L Fs
While repetitive values have difference 0, the sign of difference for 3 3 3 2 0 3 0 2 ···
non-repetitive values is also concerned. The reason is that the non- First Value Bit-packing
zero sign bits may interfere encoding in algorithms like RLBE [41] 11 11 11 10 00 11 00 10 ···
introduced in Section 4.3. If all the difference signs are positive,
in other words, the time series values are always increasing, the
encoding performs better. In contrast, when the differential value Figure 2: Examples of TS_2DIFF encoding algorithm
is negative, i.e., decreasing, the encoding performance would be
bad. In this sense, we define Count(IS) the number of increasing
key factor. Longer values generally lead to larger character encod-
values with adjacent timestamps,
ing results. Again, repeats of characters also affect the encoding
Count(IS) = |{𝑣𝑖 | 𝑣𝑖 > 𝑣𝑖−1, 1 < 𝑖 ≤ 𝑛}|. (2) performance, such as RLE [27] and HUFFMAN [29, 36].
In addition to the features on scale, delta, repeat and increase, 4 NUMERICAL DATA ENCODING
data type is also an important factor that affects the encoding per- Referring to the aforesaid discussions on lossless requirement and
formance. For INT32 and INT64, similar values have smaller deltas system architecture, we introduce six encoding algorithms that are
than those of FLOAT and DOUBLE. Moreover, the longer INT64 proper to implement in Apache IoTDB, including TS_2DIFF [8],
and DOUBLE may have more 0 bits, where bit compacting strate- GORILLA [37], SPRINTZ [20], RLE [27], RLBE [41], and RAKE
gies may perform. Therefore, in the qualitative analysis in Table 4 [21]. The source code of implementation is available in the GitHub
and the quantitative evaluation such as Figure 9, the data types are repository of Apache IoTDB [5]. Table 3 lists the common ideas
also considered as important data features. that may share among different encoding algorithms. Specifically,
a qualitative analysis of encoding effectiveness regarding to vari-
3 TEXT DATA FEATURES ous data features is presented in Table 4.
Similarly, text time series data has several data features which may
be related to encoding performance, including the distribution of 4.1 Differential-based Encoding
values, the domain of values, the average length of text value and Differential encoding proposes to reduce the absolute value when
consecutive repeats of characters. Table 2 outlines the major text the data in time series is continuous, especially when the original
features. data is large. The number of significant bits reduces since the abso-
lute value is decreased, which reduces storage costs. Thereby, com-
3.1 Value pression ratio has an important relationship with delta features in
The text values often follow a Zipfian distribution [18, 45] in prac- differential encoding algorithms. While traditional differential en-
tice. The exponent of Zipfian distribution represents the frequency coding can only perform well in monotonous integer values, GO-
of values. The larger the exponent is, the larger the skewness of RILLA and TS_2DIFF two recent advances.
value frequency is. Such skewness affects the performance of HUFF- 4.1.1 TS_2DIFF. The TS_2DIFF encoding is a variant of delta-of-
MAN encoding [29, 36], which relates to value frequency. More- delta [37]. It consists of three steps: delta encoding, second delta
over, the domain size of text values is also important, e.g., to DIC- encoding, bit-packing. The first step calculates the delta of every
TIONARY encoding [44] that stores the value domain as dictio- value by subtracting the current value from the previous one. Note
nary. that the first value does not have the previous value and should
be stored directly. Then, the algorithm finds the minimum delta,
3.2 Character mindiff , and gets the final data to store by subtracting the delta
The character features could affect the encoding algorithms that from mindiff . Finally, the leading zeros of fixed length of binary
encode data at character level. The length of values is of course a data is removed to get the final encoded byte stream.
2150
Table 4: A qualitative analysis of encoding effectiveness regarding to various numerical data features
11 11 10 12 8 ··· 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 ···
INT32 Representation [8] and [9] are Run length
(only the last byte in this figure) repeat time encoding
00001011 00001011 00001010 00001100 00001000 ··· [8] 3 [9] 5 ···
XOR with previous Bit-packing with
Header: bit-width max bit-width 3
00001011 00000000 00000001 00000110 00000100 ···
3 [8] 011 [9] 101 ···
Encoding
First Value
11 µ ¶ µ ¶ µ ¶ µ ¶ 29 µ ¶ µ ¶ µ ¶ ··· Figure 4: Examples of RLE encoding algorithm
Bit-width: 32 1 2+5+6+1 2+5+6+2 2+2 ···
2151
4 in Binary word First Value
··· 1040 ···
000100 0011 1011 1110 1001
Binary code Fibonacci 11, 14, 9 in binary words
length code
N!1040 010000010000 Setting Bit LÙÜåæç Code
I 1 1 101
II 0 - 0
Figure 7: Examples of INT32 and INT64 extensions for RLBE
III 1 1 101
IV 0 - 0
2 4 6 7 6 8 7 8 ···
Encoded data Delta encoding
2 2 2 1 -1 2 -1 1 ···
··· 101010101 ···
Zigzag encoding
3 3 3 1 2 3 2 1 ···
Figure 5: Examples of RAKE encoding algorithm
First Value Bit-packing
11 11 11 01 10 11 10 01 ···
1 2 3 7 10 ···
Differential coding
Figure 8: Examples of SPRINTZ encoding algorithm
1 1 1 4 3 ···
Binary encoding
1 1 1 100 11 ···
of length code, and sequentially followed by binary code words of
Calculate length
differential values with the same length.
1 1 1 3 2 ···
As summarized in Table 4, when the differential value is positive
Run-length
and small, RLBE performs good. RLBE performs bad when the dif-
3 1 1 ···
ferential value is negative, as the sign bit is ‘1’ and no leading ‘0’s
Fibonacci encoding
can be abolished. When adjacent differential values are of different
0011 11 11 ···
order of magnitude, i.e., variance is large, RLBE also performs bad
First Value Concatenation of length and data as run-length on length code cannot be applied.
00001 0011 1 1 1 00011 11 100 00010 11 11 ··· The examples are shown in Figure 6. While values are all increas-
ing in the time series data and all the deltas are positive, RLBE has
Figure 6: Examples of RLBE encoding algorithm a good performance such as the example in Figure 6.
To enlarge the encoding range, we extend the first 5 bits rep-
resenting binary code length to 6 bits, as shown in Figure 7. The
data, data has more leading zeros, and will be compressed more reason is that when differential value is negative, it has 32 mean-
efficiently than INT32 data, as summarized in Table 4. ingful bits, exceeding the representation range of 5 bits. Likewise,
Figure 5 shows a simple example of how values are encoded by when supporting integers of 64 bits, we expand length represent-
RAKE algorithm. Since the first 20 bits will be obviously encoded to ing binary code to 7 bits for the same reason.
five zeros, the process of compressing the first 20 bits is not shown.
For the number of Figure 5, a sparse number, 𝑁 = [010000010000] 4.3.2 SPRINTZ. The SPRINTZ encoding [20] combines encodings
is compressed by the RAKE algorithm (with 𝑇 = 4) to produce a in four steps: predicting, bit-packing, run-length encoding and en-
compressed sequence of 8 bits, [10101010]. tropy encoding. In the first step, it uses some predictive functions
(delta encoding or Fast Integer REgression encoding) to estimate
4.3 Hybrid Encoding the next coming value. Then it encodes the difference of the ac-
tual value and the predicted one. Typically, this step shrinks the
While the differential-based and run-length-based encoding algo-
absolute value to be encoded. Next, it bit-packs a block of resid-
rithms can perform well in different scenarios, there are certain
uals obtained in the first step. The largest number of significant
cases with both small delta features and vast repeats. To this end,
bits in the block is written in the header, and the leading zeros are
hybrid encoding with both ideas can achieve a better result, such
trimmed. Following that, run-length encoding and entropy encod-
as RLBE [41] and SPRINTZ [20].
ing (e.g., Huffman coding) are applied to reduce redundancy. Run-
4.3.1 RLBE. The RLBE encoding [41] proposes to combine delta, length coding compresses the consecutive zero blocks by recording
run-length and Fibonacci based encoding ideas. It has five steps: the number of zeros and entropy coding compresses the headers
differential coding, binary encoding, run-length, Fibonacci coding and payloads by encoding bytes in the form of Huffman coding.
[42] and concatenation. Specifically, delta encoding is first applied As summarized in Table 4, SPRINTZ algorithm is suitable for
to original data (integers of 32bits), and lengths of each differen- predictable time series. For delta function, the vast repeats or lin-
tial value (in binary notation) are calculated. Run-length is then early increasing time series is the best target. For the FIRE (Fast
applied to the length codes. In the concatenation phase, the first 5 Integer REgression) predictor, a constant slope is the best fit.
bits represent the length of binary words (the length is encoded in Since the example in Figure 8 has small value variance and delta
binary word), followed by the Fibonacci code words of repeat time mean, the SPRINTZ encoding has a good performance.
2152
Table 5: A qualitative analysis of encoding effectiveness re- Table 6: Parameters of numerical data generator for various
garding to various text data features data features
2153
Algorithm 1: Numerical data generator Table 7: Real-world numerical datasets
Data: 𝜇 𝑣 , 𝜇𝑑 , 𝜎𝑑 , 𝛾, 𝜂, length 𝑛
Result: TS Dataset Public # data points # time series
1 DS := empty_list(); MSRC-12 [10] 17,059 10
2 while |DS| < 𝑛 do UCI-Gas [11] 189,981 19
3 𝑖𝑠𝑅𝑒𝑝𝑒𝑎𝑡 := random_index(𝛾); WC-Vehicle 79,992 8
/* Probability of isRepeat == 1 is 𝛾, */ TH-Climate 1,317,330 140
/* probability of isRepeat == 0 is 1 − 𝛾 */ CW-AIOps 2,215,599 224
4 if 𝑖𝑠𝑅𝑒𝑝𝑒𝑎𝑡 then CS-Ship 89,991 9
5 𝑟𝑒𝑝𝑒𝑎𝑡_𝑙𝑒𝑛 := random(8, 𝑇 ); TY-Carriage 9,680,088 450
/* Get a random number in (8, 𝑇 ] */ WH-Chemistry 44,622 54
6 DS.append(0, 𝑟𝑒𝑝𝑒𝑎𝑡_𝑙𝑒𝑛); CR-Train 859,914 86
CB-Engine 533,901 88
/* Append 0 for 𝑟𝑒𝑝𝑒𝑎𝑡_𝑙𝑒𝑛 times */
7 else
8 𝑖𝑠𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 := random_index(𝜂);
9 𝑑𝑒𝑙𝑡𝑎 := 0; Table 8: Parameters of text data generator for various text
data features
10 if 𝑖𝑠𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 then
11 while 𝑑𝑒𝑙𝑡𝑎 ≤ 0 do
12 𝑑𝑒𝑙𝑡𝑎 := random_gauss(𝜇𝑑 , 𝜎𝑑 ); Notation Text Data Features Range
13 end 𝜃𝑣 Value of exponent [0, 10]
14 else 𝑁𝑣 Domain size of text values [1, 1500]
15 while 𝑑𝑒𝑙𝑡𝑎 ≥ 0 do ℓ𝑐 Average length of text value [100, 1100]
16 𝑑𝑒𝑙𝑡𝑎 := random_gauss(𝜇𝑑 , 𝜎𝑑 ); 𝛾𝑐 Repeat rate [0.9, 1]
17 end
18 end
19 DS.append(𝑑𝑒𝑙𝑡𝑎);
Lines 2-16 generate TD with domain size 𝑁 𝑣 , given the value length
20 end
ℓ𝑐 and character repeat rate 𝛾𝑐 . Then, lines 17-23 generate the dis-
21 end
tribution of values, under a Zipfian distribution with exponent 𝜃 𝑣 .
22 TS := prefix_sum(DS);
23 TS.zoom(𝜇 𝑣 );
6.4 Real-world Text Data
/* Zoom TS to adjust means to 𝜇 𝑣 */
24 return TS;
Table 9 presents several real-world text time series datasets. CW-
AIOps is a log dataset of application performance monitoring (APM)
in cloud services, collected by our industrial partners. Web Server
Access Logs, Incident Event Log Dataset and Web Log Dataset are
due to the complex application scenarios. CS-Ship monitors the sta- public datasets in Kaggle [15]. Among them, Web Server Access
tus of ship engines. Value mean and delta mean are small while in- Logs contain information on any event that was registered / logged.
crease is high. TY-Carriage contains readings of carriage monitor- Incident Event Log Dataset is a event log dataset of a website. Web
ing sensors, and it has low delta mean. WH-Chemistry is a dataset Log Dataset is the server log dataset of RUET OJ.
from chemical plant. It has high value mean, value variance, value
spread, delta mean, delta variance and delta spread. CR-Train is a
6.5 Evaluation Metrics
dataset from metro system, with low delta mean and high repeat
rate. CB-Engine consists of senor readings in concrete mixer. It has To measure the performance of encode algorithms for time series,
low delta mean, delta variance and repeat rate. two aspects are considered, compression ratio in space cost, encod-
ing and decoding time cost.
6.3 Synthetic Text Data
6.5.1 Compression Ratio. It measures the ratio of compressed (en-
The text data generator considers 4 parameters in Table 8, corre- coded) data size to uncompressed (non-encoded) data size
sponding to 4 features listed in Table 2 in Section 3. The exponent
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑒𝑑𝑆𝑖𝑧𝑒
𝜃 𝑣 determines the skewness of value distribution. 𝑁 𝑣 is the domain compressionRatio = 𝑢𝑛𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑒𝑑𝑆𝑖𝑧𝑒 .
size of text values. These two parameters decide the distribution
of values. Moreover, ℓ𝑐 represents the average length of all the text 6.5.2 Time Cost. We report two aspects of time cost. The insert
values, and repeat rate 𝛾𝑐 is the probability of generating consecu- time measures the total cost of inserting a time series, including
tive character repeats. adding to memTable, flushing from memory to disk with sorting,
Algorithm 2 shows the pseudo-code of text data generator. Let encoding, and compressing. The select time measures the total cost
TS be the time series and TD be the value domain to generate TS. of querying a time series, with decompressing and decoding.
2154
Algorithm 2: Text data generator TS_2DIFF RAKE RLBE PLAIN
GORILLA RLE SPRINTZ
Data: 𝜃 𝑣 , 𝑁 𝑣 , ℓ𝑐 , 𝛾𝑐 , length 𝑛
(a) INT32 (b) INT64
Result: TS 1.0 1.0
1 TS := empty_list()
0.8 0.8
2 TD := new string[N]
Compression Ratio
Compression Ratio
3 for 𝑖 ∈ [0, 𝑁 𝑣 ] do
0.6 0.6
4 for 𝑗 ∈ [0, ℓ𝑐 ] do
5 if 𝑗 = 0 then 0.4 0.4
6 TDi .append(rand_Char())
7 continue 0.2 0.2
8 end
0.0 0.0
9 𝑖𝑠𝑅𝑒𝑝𝑒𝑎𝑡 := random_index(𝛾𝑐 ) NONE SNAPPY LZ4 GZIP NONE SNAPPY LZ4 GZIP
Compression Compression
/* Probability of isRepeat == 1 is 𝛾𝑐 , 0.8
(c) FLOAT
1.0
(d) DOUBLE
probability of isRepeat == 0 is 1 − 𝛾𝑐 */
10 if 𝑖𝑠𝑅𝑒𝑝𝑒𝑎𝑡 then 0.8
Compression Ratio
Compression Ratio
0.6
11 TDi .append(TDi .back())
0.6
12 else 0.4
13 TDi .append(rand_Char_except(TDi .back())) 0.4
14 end
0.2
15 end 0.2
16 end
0.0 0.0
1
( )𝜃𝑣 NONE SNAPPY LZ4 GZIP NONE SNAPPY LZ4 GZIP
17 num := {𝑛𝑢𝑚 0, ..., 𝑛𝑢𝑚 𝑁 𝑣 −1 }, where 𝑛𝑢𝑚𝑖 = Í𝑁𝑣 −1𝑖+1 1 𝜃𝑣 𝑛 Compression Compression
𝑗 =0 ( 𝑗 +1 )
18 for 𝑖 ∈ [0, 𝑁 𝑣 ] do
19 for 𝑗 ∈ [0, 𝑛𝑢𝑚𝑖 ] do Figure 9: Compression ratio over all numerical datasets
20 TS.append(TDi )
21 end
22 end 7.1 Real-world Numerical Data Evaluation
23 TS = random_permutation(TS) return TS;
Figures 9, 10 and 11 report the compression ratio, insert time and
select time defined in Section 6.5, respectively. The experiments are
Table 9: Real-world text datasets conducted over all 1088 time series in Table 7. In boxplot chart, ev-
ery element represents one time series. We study 28 combinations
Dataset Public # points # series of 7 encoding schemes and 4 compression schemes on 4 different
data types. In Figure 9, the lower the metrics are, the better the per-
CW-AIOps 2,215,599 224 formance is. Since the compression ratios are the same in different
Web Server Access Logs [12] 10,365,152 1 runs, we do not repeat the experiments of compression ratio. The
Incident Event Log [13] 141,712 35 experiments on time cost are repeated 50 times.
Web Log [14] 258,441 4
7.1.1 Comparison. As illustrated in Figure 9, TS_2DIFF encoding
achieves good (low) compression ratio, with or without compres-
7 EXPERIMENTAL EVALUATION sion. RAKE encoding performs even worse than PLAIN (no encod-
In this section, we conduct the experimental evaluation of encod- ing) when handling INT32 and FLOAT.
ing algorithms analyzed in Section 4. Note that encoding and com- As introduced in Section 4.2.2, when there are many consecu-
pressing are complementary. Therefore, we study PLAIN (no en- tive 0 bits, RAKE performs. The more the 1 bits are, the worse the
coding), TS_2DIFF [8], GORILLA [37], SPRINTZ [20], RLE [27], RAKE algorithm performs. For INT32 and FLOAT, since there are
RLBE [41], and RAKE [21] encoding, combined with NONE (no less 0 bits than INT64 and DOUBLE for the same values, the perfor-
compression), SNAPPY [38], GZIP [16] or LZ4 [19] compressor. mance of RAKE is worse. In particular, for negative numbers with
We present their performances over various data types including a small absolute value, owing to the leading sign bit ‘1’ and more
INT32, INT64, FLOAT and DOUBLE as listed in Table 4. With the leading ‘1’s, RAKE may perform even worse given the extra cost
help of benchmark in Section 6, the results on both the real-world of setting bits and so on. Similar results are also observed in Figure
and synthetic data with various features are reported. 14(a), where RAKE again performs worse than PLAIN on negative
The experiments run on a machine with 2.1 GHz Intel(R) Xeon(R) numbers (with value mean less than 0).
CPU and 128GB RAM. While the source code of encoding algo- GORILLA performs better on INT32 and INT64 than FLOAT and
rithms has been deployed in the GitHub repository of Apache IoTDB DOUBLE since positions of leading and trailing 0 are more similar.
[5], the experiment related code and data are available in [6]. The results verify the qualitative analysis on data types in Table 4.
2155
TS_2DIFF RAKE RLBE PLAIN
TS_2DIFF RAKE RLBE PLAIN
GORILLA RLE SPRINTZ
GORILLA RLE SPRINTZ
(a) INT32 (b) INT64
0.26 0.20
0.7
0.22 0.18
Compression Ratio
Insert Time
Insert Time
0.6
0.20 0.17
0.5
0.18 0.16
0.4
0.15
0.16
0.3
0.14
0.14 0.2
0.13
0.12 0.1
NONE SNAPPY LZ4 GZIP NONE SNAPPY LZ4 GZIP
Compression Compression 0.0
(c) FLOAT (d) DOUBLE
12
as
ge
e
ps
CR ry
in
e
cl
at
in
hi
0.19
ra
t
-G
IO
ria
C-
is
hi
ng
-S
lim
-T
m
-A
SR
CI
Ve
0.20
ar
CS
-E
-C
he
CW
U
C-
-C
M
CB
TH
0.18
-C
TY
W
H
W
0.17 0.18 Dataset
Insert Time
Insert Time
104
Figure 10: Insert time over all numerical datasets
value
102
100
TS_2DIFF RAKE RLBE PLAIN
GORILLA RLE SPRINTZ 10− 2
as
ge
le
ps
CR ry
in
e
at
in
hi
ic
ra
t
-G
IO
ria
C-
is
ng
-S
lim
h
-T
m
-A
SR
CI
Ve
ar
CS
-E
-C
he
CW
U
0.030
C-
0.04
-C
M
CB
TH
-C
TY
W
Select Time
Select Time
H
W
0.025 Dataset
0.03
0.020
(b) Feature value of each dataset
0.02
0.015
0.010 0.01 Figure 12: Compression ratio and features on each dataset
NONE SNAPPY LZ4 GZIP NONE SNAPPY LZ4 GZIP
Compression Compression
(c) FLOAT (d) DOUBLE
0.035 7.1.2 Individual Datasets. In addition to the qualitative analysis
0.030 0.04 on data types, we further validate the effects of data features an-
alyzed in Table 4. Figure 12(a) reports the compression ratio of 7
encoding schemes without compression applied (NONE). Figure
Select Time
Select Time
0.025
0.03
0.020
12(b) shows the corresponding 8 data features listed in Table 1.
In general, TS_2DIFF encoding still achieves good performance,
0.02
0.015 while RAKE could be worse than PLAIN without encoding, similar
0.010
as the observations in Figure 11. For the datasets with large delta
0.01
mean, such as UCI-Gas, TH-Climate, MSRC-12, CS-Ship and TY-
NONE SNAPPY LZ4 GZIP NONE SNAPPY LZ4 GZIP Carriage, TS_2DIFF still performs well. For the datasets with small
Compression Compression
value variance and delta mean, such as WC-Vehicle and CB-Engine,
GORILLA performs better. RLBE performs better in CS-Ship than
Figure 11: Select time over all numerical datasets
other datasets, since CS-Ship has relatively smaller delta mean and
delta variance. The results verify again the analysis in Table 4.
Nevertheless, with compression, the space cost is further reduced. Note that both time and value series are encoded and compressed,
The improved by compression after TS_2DIFF encoding, however, and the statistics stored in the PageHeader consider both time and
is limited, with extra compressing and decompressing time. value series compression. Since time is encoded and compressed
2156
TS_2DIFF RAKE RLBE PLAIN
TS_2DIFF RAKE RLBE PLAIN
GORILLA RLE SPRINTZ
GORILLA RLE SPRINTZ
(a) Value Mean (b) Value Mean (c) Value Mean
0.012
(a)GZIP (b)LZ4 0.8 0.15
DT DT 0.010
Compression Ratio
Select Time(s)
Insert Time(s)
0.6 0.008
0.10
CT CT 0.006
1.0 1.0 0.4
0.8 0.8 0.05 0.004
0.6 0.6
0.4 0.4
0.2 0.2 0.2 0.002
0.0 0.0
ET ET 0.00 0.000
-50000 0 50000 -50000 0 50000 -50000 0 50000
Value Mean Value Mean Value Mean
Compression Ratio
CT CT
Select Time(s)
Insert Time(s)
0.5 0.008
1.0 1.0 0.10
0.8 0.8
0.6 0.6 0.4 0.006
0.4 0.4
0.2 0.2
0.0 0.0 0.3 0.05 0.004
ET ET
0.002
0.2
0.00 0.000
-500 0 500 -500 0 500 -500 0 500
Delta Mean Delta Mean Delta Mean
UT UT
Select Time(s)
Insert Time(s)
0.008
0.10
compression on value series together with time is less than 1. 0.4
0.006
7.1.3 Trade-off. Figure 13 breaks down the time costs into encod- 0.2 0.05 0.004
ing time (ET), decoding time (DT), compression time (CT), uncom- 0.002
0.0 0.00 0.000
pression time (UT) and the corresponding compression ratio (CR) 0 500 1000 0 500 1000 0 500 1000
Delta Variance Delta Variance Delta Variance
of different encoding algorithms together with four compression
strategies. The experiments run on all the real-world datasets in
Figure 16: Varying delta feature of delta variance 𝜎𝑑
Table 7 and report the average. For each dimension, we normalize
the results into a range between 0 and 1, the larger the better. For
TS_2DIFF RAKE RLBE PLAIN
ET, DT, CT and UT, a larger value represents lower time compared GORILLA RLE SPRINTZ
with other encoding algorithms, i.e., more efficient. For compres- (a) Repeat (b) Repeat (c) Repeat
0.012
sion ratio (CR), a larger metric represents lower compression ratio, 0.6
0.15
0.010
Compression Ratio
0.008
0.10
As shown, most encoding algorithms are efficient in encoding 0.4
0.006
(ET). TS_2DIFF has better compression ratio (CR) as well as com- 0.004
0.2 0.05
pression time (CT) and uncompression time (UT), but the corre- 0.002
sponding decoding time (DT) is worse. GORILLA with both better 0.0 0.00 0.000
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
encoding and decoding time (ET and DT) has worse compression Repeat Repeat Repeat
ratio (CR). Similar trade-off results are also observed in Figure 13(d)
without compression (NONE). Figure 17: Varying repeat feature of repeat rate 𝛾
2157
TS_2DIFF RAKE RLBE PLAIN HUFFMAN DICTIONARY RLE PLAIN
GORILLA RLE SPRINTZ (a) Compression Ratio (b) Insert Time (c) Select Time
(a) Increase (b) Increase (c) Increase 4 0.8
0.012 2.5
Compression Ratio
0.15
0.6 0.010 2.0
Compression Ratio
Select Time
Insert Time
0.6
Select Time(s)
Insert Time(s)
0.008 1.5
0.10 2
0.4 0.006 1.0
0.4
0.004 1
0.05 0.5
0.2 0.002
0 0.0
NONESNAPPY LZ4 GZIP NONESNAPPY LZ4 GZIP NONESNAPPY LZ4 GZIP
0.00 0.000
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Compression Compression Compression
Increase Increase Increase
Compression Ratio
1.00 0.8
peat times and differential value instead of storing lots of large val- 3
Select Time(s)
Insert Time(s)
ues. GORILLA is unstable, as its performance is affected by XOR, 0.75 0.6
2
unrelated to value mean. 0.50 0.4
For delta features, Figure 15(a) varies delta mean 𝜇𝑑 and Figure 0.25 1
0.2
16(a) considers delta variance 𝜎𝑑 . TS_2DIFF performance decreases
0.00
with the increase of delta variance in Figure 16(a), which is not 0 5 10
0.0
0 5 10
0
0 5 10
surprising referring to the analysis in Table 4. Exponent Exponent Exponent
Figure 18(a) varies the increase rate 𝜂. When increase rate be-
Compression Ratio
0.8 0.8
3
Select Time(s)
Insert Time(s)
comes larger, more positive values appear, beneficial to bit-pack. 0.6 0.6
Thereby, RLBE is positively correlated with the increase rate. 2
0.4 0.4
7.2.2 Time Cost. Analogous to the compression ratio results in 0.2 1
0.2
Section 7.2.1, Figures 14(b), 15(b), 16(b), 17(b) and 18(b), report the
0.0
insert time under various value mean 𝜇 𝑣 , delta mean 𝜇𝑑 , delta vari- 0.0 0
0 1000 0 1000 0 1000
Domain Domain Domain
ance 𝜎𝑑 , repeat rate 𝛾 and increase rate 𝜂, respectively. Likewise,
Figures 14(c)-18(c) are the corresponding select time. Due to the
limited space, similar results on other setting combinations are Figure 21: Varying text feature of domain size 𝑁 𝑣
omitted. Each test is conducted 50 times and reports the average.
As shown, while the compression ratio is affected largely by var-
ious data features in Figures 14(a)-18(a), the corresponding insert limited repeated characters as analyzed in Section 5.2. When com-
time and select time are stable. Different encoding methods indeed pression is applied, either SNAPPY, LZ4 or GZIP, the DICTIONARY
lead to very close insert and select time. Similar to Figures 10 and encoding has almost the best compression ratio and time cost.
11, the insert time is much higher than the select time under var-
ious data features. Due to the extremely low select time, the vari- 7.4 Varying Text Data Features
ances of select time in the tests are relatively larger. The results are Similar to the numeric data, we generate text data with different
generally consistent with those on real datasets in Section 7.1. features mentioned in Section 3, by Algorithm 2, and evaluate the
encoding performance on the synthetic text data.
7.3 Real-world Text Data Evaluation For value features, Figure 20 varies the exponent 𝜃 𝑣 introduced
Figure 19 reports compression ratio, insert time and select time in Table 8. The larger the exponent is, i.e., the data distribution is
defined in Section 6.5, respectively, over all text datasets in Table 9. more skewed, the better HUFFMAN performs. The improvement
We perform the 16 combinations of 4 text encoding schemes and 4 of compression ratio however is not significant. The other algo-
general compression schemes on text data. rithms are not affected by the exponent, as also analyzed in Table 5.
As shown in Figure 19, when there is no compression algorithm Figure 21 varies the domain size 𝑁 𝑣 introduced in Table 8. It
applied, HUFFMAN has the best performance in compression ra- is not surprising that DICTIONARY performs worse with the in-
tio, but it has the worst performance in time cost. RLE shows even crease of domain size. In contrast, DICTIONARY favors a larger
worse compression ratio than PLAIN (no encoding), owing to the value length as illustrated in Figure 22, with a slight improvement.
2158
HUFFMAN DICTIONARY RLE PLAIN 8.2 Lossy Encoding
(a) Length (b) Length (c) Length
1.25 1.0 4 While we focus on lossless encoding required in databases, the
lossy encoding is also practical in other applications especially in
Compression Ratio
1.00 0.8
3
Select Time(s)
Insert Time(s)
0.75 0.6
end or edge devices. Plato [31] proposes to reduce noise. ODH
2 [30] adopts different lossy algorithms to encode the linear and
0.50 0.4
non-linear data, respectively. Eichinger et al. [25] propose to es-
1
0.25 0.2 timate time series in piecewise polynomial by applying a greedy
0.00
0.0 0
method and three different online regression algorithms, includ-
500 1000 500 1000 500 1000
Length Length Length
ing a PMR-Midrange algorithm [33], an optimal approximation al-
gorithm [24], and a randomized algorithm [39], for approximat-
Figure 22: Varying text feature of value length ℓ𝑐 ing constant functions, straight lines, and polynomials. Fink and
Gandhi [26] introduce an algorithm to encode time series by ex-
ploiting time series maxima and minima. TRISTRAN [34] and COR-
HUFFMAN DICTIONARY RLE PLAIN
AD [32] use autocorrelation in one or multiple time series to im-
(a) Repeat (b) Repeat (c) Repeat prove compression ratio and accuracy.
1.0 4
1.0
Compression Ratio
0.8 0.8
3 8.3 General Compression
Select Time(s)
Insert Time(s)
0.6 0.6 In Apache IoTDB, a compression step for general data is applied
2
0.4 0.4 after the time series is encoded, i.e., complementary. The compres-
0.2 1 sion algorithms implemented in Apache IoTDB, GZIP [16], SNAPPY
0.2
[38] and LZ4 [19], all originate from LZ77 [48], looking for the
0.0
0.90 0.95 1.00
0.0
0.90 0.95 1.00
0
0.90 0.95 1.00 longest match string using a sliding window on the input stream.
Repeat Repeat Repeat
Nevertheless, the results in Figure 9 show that TS_2DIFF encoding
is already efficient, while further applying general purpose com-
Figure 23: Varying text feature of repeat rate 𝛾𝑐 pression cannot reduce the space cost further.
9 CONCLUSION
In Figure 23, the compression ratio of RLE significantly improves In this paper, we provide both qualitative and quantitative analy-
when the character repeat rate 𝛾𝑐 is large, as analyzed in Table 5 sis of time series encoding algorithms regarding to various data
and Section 5.2. However, such character repeats may not be preva- features. The comparison is conducted in Apache IoTDB, an open-
lent in practice as illustrated in Figure 19(a). source time series database developed in our preliminary study
Due to more characters, insert time significantly increases while [43]. First, we profile several features that may affect the perfor-
length is increasing. And insert time is almost unchanged with ex- mance of encoding. The qualitative and quantitative analysis is
ponent, domain and repeat varying. thus built on these data features. To evaluate the encoding algo-
Since HUFFMAN algorithm needs to recover Huffman tree in rithms, we present a benchmark with real-world data and a data
the process of selecting data, it has significantly higher select time. generator for various features.
When repeat becomes larger, the Huffman tree becomes smaller, We notice that different encoding algorithms favor various data
and select time decreases in Figure 23(c). In contrast, with the in- features. It motivates us to recommend distinct encoding algorithms
crease of length, the Huffman tree becomes larger, and select time referring to the features for different datasets. While some prelimi-
increases in Figure 22(c). nary results of encoding recommender are presented in Appendix
A in [17], it is expected to improve the recommender further. For in-
stance, one may employ more advanced machine learning models
8 RELATED WORK to train a more accurate recommender. Incremental and transfer
While this study focuses on encoding methods that are proper to learning could also be applied, to address evolving data features
implement in time series database management systems, there do and generalize over never-seen datasets. In addition to pursuing
exist many other alternatives (see [22] for a survey). more concise encoding, one may also expect to balance the space
cost and the time cost of efficient query processing.
8.1 Lossless Encoding
In addition to the lossless encoding algorithms studied in this pa- ACKNOWLEDGMENTS
per, the dictionary-based algorithms [32, 34] are not practical to This work is supported in part by the National Natural Science
implement for numerical values, since the dictionaries could be Foundation of China (62072265, 62021002), the National Key Re-
too large to store in the PageHeader of Apache IoTDB. Similarly, search and Development Plan (2021YFB3300500, 2019YFB1705301,
machine learning based lossless encoding, such as [47] consisting 2019YFB1707001), Beijing National Research Center for Informa-
of a transform stage and an encoding stage, needs to conduct rein- tion Science and Technology (BNR2022RC01011), and Alibaba Group
forcement learning. Not only the models are too large to store, but through Alibaba Innovative Research (AIR) Program. Shaoxu Song
also the learning is too heavy to process inside databases. (https://sxsong.github.io/) is the corresponding author.
2159
REFERENCES Vijayaraman, editors, Proceedings of the 19th International Conference on Data
[1] https://iotdb.apache.org/. Engineering, March 5-8, 2003, Bangalore, India, pages 429–440. IEEE Computer
[2] https://www.influxdata.com/. Society, 2003.
[3] http://opentsdb.net/. [34] Alice Marascu, Pascal Pompey, Eric Bouillet, Michael Wurst, Olivier Verscheure,
[4] https://prometheus.io/. Martin Grund, and Philippe Cudré-Mauroux. TRISTAN: real-time analytics on
[5] https://github.com/apache/iotdb/tree/research/encoding-exp. massive time series using sparse dictionary compression. In Jimmy Lin, Jian Pei,
[6] https://github.com/xjz17/iotdb/tree/TSEncoding. Xiaohua Hu, Wo Chang, Raghunath Nambiar, Charu C. Aggarwal, Nick Cercone,
[7] https://thulab.github.io/iotdb-quality/. Vasant G. Honavar, Jun Huan, Bamshad Mobasher, and Saumyadipta Pyne, edi-
[8] https://iotdb.apache.org/UserGuide/Master/Data-Concept/Encoding.html. tors, 2014 IEEE International Conference on Big Data (IEEE BigData 2014), Wash-
[9] https://github.com/thulab/iotdb-benchmark. ington, DC, USA, October 27-30, 2014, pages 291–300. IEEE Computer Society,
[10] https://www.microsoft.com/en-us/download/details.aspx. 2014.
[11] https://archive.ics.uci.edu. [35] V. Krishna Nandivada and Rajkishore Barik. Improved bitwidth-aware variable
[12] https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs. packing. ACM Trans. Archit. Code Optim., 10(3):16:1–16:22, 2013.
[36] Ghim Hwee Ong and Shell-Ying Huang. A data compression scheme for chinese
[13] https://www.kaggle.com/datasets/winmedals/incident-event-log-dataset.
text files using huffman coding and a two-level dictionary. Inf. Sci., 84(1&2):85–
[14] https://www.kaggle.com/datasets/shawon10/web-log-dataset.
99, 1995.
[15] https://www.kaggle.com/datasets/.
[37] Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin
[16] https://www.gnu.org/software/gzip/.
Teller, and Kaushik Veeraraghavan. Gorilla: A fast, scalable, in-memory time
[17] https://sxsong.github.io/doc/encoding.pdf.
series database. Proc. VLDB Endow., 8(12):1816–1827, 2015.
[18] Anders Aamand, Piotr Indyk, and Ali Vakilian. (learned) frequency estimation
[38] Horst Samulowitz, Chandra Reddy, Ashish Sabharwal, and Meinolf Sellmann.
algorithms under zipfian distribution. CoRR, abs/1908.05198, 2019.
Snappy: A simple algorithm portfolio. In Matti Järvisalo and Allen Van Gelder,
[19] Matej Bartik, Sven Ubik, and Pavel Kubalík. LZ4 compression algorithm on
editors, Theory and Applications of Satisfiability Testing - SAT 2013 - 16th Inter-
FPGA. In 2015 IEEE International Conference on Electronics, Circuits, and Systems,
national Conference, Helsinki, Finland, July 8-12, 2013. Proceedings, volume 7962
ICECS 2015, Cairo, Egypt, December 6-9, 2015, pages 179–182. IEEE, 2015.
of Lecture Notes in Computer Science, pages 422–428. Springer, 2013.
[20] Davis W. Blalock, Samuel Madden, and John V. Guttag. Sprintz: Time series
[39] Raimund Seidel. Small-dimensional linear programming and convex hulls made
compression for the internet of things. CoRR, abs/1808.02515, 2018.
easy. Discret. Comput. Geom., 6:423–434, 1991.
[21] Giuseppe Campobello, Antonino Segreto, Sarah Zanafi, and Salvatore Serrano.
[40] Bin Song, Limin Xiao, Guangjun Qin, Li Ruan, and Shida Qiu. A deduplication
RAKE: A simple and efficient lossless compression algorithm for the internet of
algorithm based on data similarity and delta encoding. In Hanning Yuan, Jing
things. In 25th European Signal Processing Conference, EUSIPCO 2017, Kos, Greece,
Geng, and Fuling Bian, editors, Geo-Spatial Knowledge and Intelligence - 4th In-
August 28 - September 2, 2017, pages 2581–2585. IEEE, 2017.
ternational Conference on Geo-Informatics in Resource Management and Sustain-
[22] Giacomo Chiarot and Claudio Silvestri. Time series compression: a survey.
able Ecosystem, GRMSE 2016, Hong Kong, China, November 18-20, 2016, Revised
CoRR, abs/2101.08784, 2021.
Selected Papers, Part II, volume 699 of Communications in Computer and Infor-
[23] E. F. Codd. Relational database: A practical foundation for productivity. Com-
mation Science, pages 245–253. Springer, 2016.
mun. ACM, 25(2):109–117, 1982.
[41] Julien Spiegel, Patrice Wira, and Gilles Hermann. A comparative experimen-
[24] Marco Dalai and Riccardo Leonardi. Approximations of one-dimensional digital
tal study of lossless compression algorithms for enhancing energy efficiency in
signals under the linfty norm. IEEE Trans. Signal Process., 54(8):3111–3124, 2006.
smart meters. In 16th IEEE International Conference on Industrial Informatics,
[25] Frank Eichinger, Pavel Efros, Stamatis Karnouskos, and Klemens Böhm. A time- INDIN 2018, Porto, Portugal, July 18-20, 2018, pages 447–452. IEEE, 2018.
series compression technique and its application to the smart grid. VLDB J., [42] Jirí Walder, Michal Krátký, and Jan Platos. Fast fibonacci encoding algorithm.
24(2):193–218, 2015. In Jaroslav Pokorný, Václav Snásel, and Karel Richta, editors, Proceedings of the
[26] Eugene Fink and Harith Suman Gandhi. Compression of time series by extract- Dateso 2010 Annual International Workshop on DAtabases, TExts, Specifications
ing major extrema. J. Exp. Theor. Artif. Intell., 23(2):255–270, 2011. and Objects, Stedronin-Plazy, Czech Republic, April 21-23, 2010, volume 567 of
[27] Solomon W. Golomb. Run-length encodings (corresp.). IEEE Trans. Inf. Theory, CEUR Workshop Proceedings, pages 72–83. CEUR-WS.org, 2010.
12(3):399–401, 1966. [43] Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang,
[28] Muon Ha and Yulia A. Shichkina. Translating a distributed relational database Rong Kang, Julian Feinauer, Kevin Mcgrail, Peng Wang, Diaohan Luo, Jun Yuan,
to a document database. Data Sci. Eng., 7(2):136–155, 2022. Jianmin Wang, and Jiaguang Sun. Apache iotdb: Time-series database for inter-
[29] Paul G. Howard and Jeffrey Scott Vitter. Parallel lossless image compression net of things. Proc. VLDB Endow., 13(12):2901–2904, 2020.
using huffman and arithmetic coding. Inf. Process. Lett., 59(2):65–73, 1996. [44] Terry A. Welch. A technique for high-performance data compression. Computer,
[30] Sheng Huang, Yaoliang Chen, Xiaoyan Chen, Kai Liu, Xiaomin Xu, Chen Wang, 17(6):8–19, 1984.
Kevin Brown, and Inge Halilovic. The next generation operational data histo- [45] Raymond Chi-Wing Wong and Ada Wai-Chee Fu. Mining top-k itemsets over
rian for iot based on informix. In Curtis E. Dyreson, Feifei Li, and M. Tamer a sliding window based on zipfian distribution. In Hillol Kargupta, Jaideep Sri-
Özsu, editors, International Conference on Management of Data, SIGMOD 2014, vastava, Chandrika Kamath, and Arnold Goodman, editors, Proceedings of the
Snowbird, UT, USA, June 22-27, 2014, pages 169–176. ACM, 2014. 2005 SIAM International Conference on Data Mining, SDM 2005, Newport Beach,
[31] Yannis Katsis, Yoav Freund, and Yannis Papakonstantinou. Combining CA, USA, April 21-23, 2005, pages 516–520. SIAM, 2005.
databases and signal processing in plato. In Seventh Biennial Conference on Inno- [46] Retaj Yousri, Madyan Alsenwi, M. Saeed Darweesh, and Tawfik Ismail. A design
vative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4-7, 2015, for an efficient hybrid compression system for eeg data. In 2021 International
Online Proceedings. www.cidrdb.org, 2015. Conference on Electronic Engineering (ICEEM), pages 1–6, 2021.
[32] Abdelouahab Khelifati, Mourad Khayati, and Philippe Cudré-Mauroux. CORAD: [47] Xinyang Yu, Yanqing Peng, Feifei Li, Sheng Wang, Xiaowei Shen, Huijun Mai,
correlation-aware compression of massive time series using sparse dictionary and Yue Xie. Two-level data compression using machine learning in time series
coding. In Chaitanya Baru, Jun Huan, Latifur Khan, Xiaohua Hu, Ronay Ak, database. In 36th IEEE International Conference on Data Engineering, ICDE 2020,
Yuanyuan Tian, Roger S. Barga, Carlo Zaniolo, Kisung Lee, and Yanfang Fanny Dallas, TX, USA, April 20-24, 2020, pages 1333–1344. IEEE, 2020.
Ye, editors, 2019 IEEE International Conference on Big Data (IEEE BigData), Los [48] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data com-
Angeles, CA, USA, December 9-12, 2019, pages 2289–2298. IEEE, 2019. pression. IEEE Trans. Inf. Theory, 23(3):337–343, 1977.
[33] Iosif Lazaridis and Sharad Mehrotra. Capturing sensor-generated time series
with quality guarantees. In Umeshwar Dayal, Krithi Ramamritham, and T. M.
2160