0% found this document useful (0 votes)
25 views13 pages

A Comparative Analysis in Apache IoTDB

Uploaded by

derrickwusp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views13 pages

A Comparative Analysis in Apache IoTDB

Uploaded by

derrickwusp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Time Series Data Encoding for Efficient Storage: A Comparative

Analysis in Apache IoTDB


Jinzhao Xiao Yuxiang Huang Changyu Hu
BNRist, Tsinghua University BNRist, Tsinghua University BNRist, Tsinghua University
[email protected] [email protected] [email protected]

Shaoxu Song Xiangdong Huang Jianmin Wang


BNRist, Tsinghua University BNRist, Tsinghua University BNRist, Tsinghua University
[email protected] [email protected] [email protected]

ABSTRACT (a) Large Scale (b) Large Delta

Not only the vast applications but also the distinct features of time
series data stimulate the booming growth of time series database
management systems, such as Apache IoTDB, InfluxDB, OpenTSDB
and so on. Almost all these systems employ columnar storage, with

s_0

s_0
effective encoding of time series data. Given the distinct features
of various time series data, it is not surprising that different en-
coding strategies may perform variously. In this study, we first
summarize the features of time series data that may affect encod-
Sensor Sensor

(c) Vast Repeats (d) Vast Increases


ing performance, including scale, delta, repeat and increase. Then,
we introduce the storage scheme of a typical time series database,
Apache IoTDB, prescribing the limits to implementing encoding
algorithms in the system. A qualitative analysis of encoding effec-
tiveness regarding to various data features is then presented for the

value
s_0

studied algorithms. To this end, we develop a benchmark for eval-


uating encoding algorithms, including a data generator regarding
the aforesaid data features and several real-world datasets from our
industrial partners. Finally, we present an extensive experimental
Sensor timestamp

evaluation using the benchmark. Remarkably, a quantitative anal-


ysis of encoding effectiveness regarding to various data features is Figure 1: Example of real data with distinct features on (a)
conducted in Apache IoTDB. large scale, (b) large delta, (c) vast repeats and (d) vast in-
creases, affecting the encoding performance
PVLDB Reference Format:
Jinzhao Xiao, Yuxiang Huang, Changyu Hu, Shaoxu Song, Xiangdong
Huang, and Jianmin Wang. Time Series Data Encoding for Efficient
Storage: A Comparative Analysis in Apache IoTDB . PVLDB, 15(10): 2148 - commercial, such as Apache IoTDB [1], InfluxDB [2], OpenTSDB
2160, 2022. [3], Prometheus [4] and so on. It is not surprising that almost all
doi:10.14778/3547305.3547319 these systems employ columnar storage, given time series natu-
rally organized by two columns, time and value. In particular, the
PVLDB Artifact Availability: column-oriented scheme enables effective encoding and compres-
The source code, data, and/or other artifacts have been made available at sion of time series data. Obviously, distinct features of various data
https://github.com/xjz17/iotdb/tree/TSEncoding. as illustrated in Figure 1 lead to different encoding performances.
While general purpose data compression methods can be di-
1 INTRODUCTION rectly applied, e.g., SNAPPY [38] and LZ4 [19], the encoding tech-
The time ordered values, write intensive workloads and other spe- niques are often specialized for time series, under some intuitions
cial features make the management of time series data distinct like values usually not changing significantly over time, i.e., small
from relational databases [23, 28], and thus lead to the develop- delta. Though lossy approaches like expressing time series in piece-
ment of time series database management systems, open source or wise polynomial [25] are highly efficient in reducing space and
useful in edge or end devices, as a database, industrial customers
This work is licensed under the Creative Commons BY-NC-ND 4.0 International expect a complete archive of all the digital asset, i.e., lossless. More-
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
over, the extremely intensive write workloads, often machine gen-
emailing [email protected]. Copyright is held by the owner/author(s). Publication erated in IoT scenarios, prevent the time consuming approaches
rights licensed to the VLDB Endowment. such as machine learning based reinforcement learning [47]. In
Proceedings of the VLDB Endowment, Vol. 15, No. 10 ISSN 2150-8097.
doi:10.14778/3547305.3547319 this sense, the scope of this study is within lossless encoding with
efficient system implementation.

2148
In this paper, we present a comparative analysis of time series Table 1: Numerical data features
data encoding techniques in Apache IoTDB, an open-source time
series database developed in our preliminary studies [43]. Our ma- Category Notation Feature
jor contributions are summarized as follows. Scale Mean(TS) Mean of values
(1) We summarize several time series data features that may af- Scale Var(TS) Variance of values
fect the performance of encoding in Section 2. Intuitively, as illus- Scale Spread(TS) Maximum minus minimum of values
trated in Figure 1, the scale of values is obviously an important Delta Mean(DS) Mean of deltas
factor of storage. Likewise, when storing the delta between two Delta Var(DS) Variance of deltas
consecutive values, it becomes a key issue. The number of value Delta Spread(DS) Maximum minus minimum of deltas
repeats and increases are also essential to some encoding ideas. Repeat Count(RS) Count of consecutive repeats
(2) We present a qualitative analysis of encoding effectiveness
Increase Count(IS) Count of increases
regarding to various data features in Section 4.
While there is no winner in all the data features, TS_2DIFF per-
forms well in a number of cases. For the cases where TS_2DIFF features can be directly calculated in Apache IoTDB by the data
shows worse results, such as repeat rate 1 in Figure 17, it may be profiling tools developed in our previous study [7].
less frequent and not that significant in practice.
(3) We devise a benchmark for time series data encoding. It con- 2.1 Scale
sists of (a) data generators for simulating various data features, (b) The scale of data is one of the most important factors in storage. In
several real-world datasets, public or collected by our industrial general, the larger the values are, the more bits we need to encode
partners, (c) metrics such as compression ratio (space cost after them. As illustrated in Section 4 below, the run-length based algo-
encoding and compressing divided by original space cost). In par- rithms [27] need to store the header, where more bits are needed
ticular, multiple features could vary at the same time in the gener- for larger values. Bit-packing algorithms [35] are similarly affected.
ator, such as large values but small deltas, so that the distinct cases Besides, when most values are negative, bit-packing based algo-
favored by different algorithms could be illustrated. Moreover, dif- rithms performs bad since sign bits are 1. To this end, we em-
ferent data types are supported, including INT32, INT64, FLOAT ploy the mean, variance and spread (maximum minus minimum)
and DOUBLE of numerical values, as well as text values. of the values in time series TS, denoted by Mean(TS), Var(TS), and
(4) We conduct an extensive experimental evaluation in Section Spread(TS) to represent the scale features.
7. The quantitative analysis generally verifies the aforesaid quali-
tative analysis of encoding performance regarding to various data 2.2 Delta
features.
The delta features show the amplitude of data fluctuations, partic-
Finally, we also discuss some related work in Section 8, and out-
ularly important to time series. Let DS = [𝑣 2 − 𝑣 1, 𝑣 3 − 𝑣 2, . . . , 𝑣𝑛 −
line some future directions in Section 9 referring to the analysis.
𝑣𝑛−1 ] denote the delta series of the time series DS, measuring the
The source code of encoding algorithms has been deployed in the
deltas of time-adjacent values. The differential-based algorithms
GitHub repository of Apache IoTDB [5]. The experiment related
[40] as introduced in Section 4.1 store these deltas. In this sense,
code and data are available in [6].
we use Mean(DS), Var(DS), and Spread(DS), mean, variance, and
spread (maximum minus minimum) of deltas, to evaluate how large
the deltas could be. It is worth noting that to some extent, Var(TS)
2 NUMERICAL DATA FEATURES also reflects the delta features, and likewise, Var(DS) understands
delta of deltas, important to some encoding such as TS_2DIFF dis-
We select several features that may affect the performance of time
cussed in Section 4.1.
series encoding. As illustrated in Section 4, we have three types
of lossless encoding algorithms, RLE-based, Diff-based and hybrid.
For all the algorithms, scale is an important feature, since tech-
2.3 Repeat
niques like bit-packing are widely used. Diff-based algorithms fa- Repetitive values are widely observed in time series, such as un-
vor small changes in consecutive values, and thus the delta feature changed temperature reading in several minutes. Such consecutive
is considered. RLE-based algorithms handle repeating contents in repeats can be compressed by run-length based algorithms [27] in
consecutive values, i.e., the repeat feature. Finally, we notice that Section 4.2. They also output zero values in XOR operators intro-
the signal bit of negative delta also affects the performance of bit duced in Section 4.1, shrinking the space efficiently. To this end,
compression, leading to the increase feature of consecutive values. we introduce a method to describe the repeats of time series. The
In addition to the aforesaid four features, there do exist others main idea is to count the number of consecutively repeated values
for consideration. For instance, signal-to-noise ratio (SNR) could which are in the interval of consecutive repetitive values. We de-
be considered in frequency-domain-based compression [46], which fine RS the repeat count series of time series TS = [𝑟 1, 𝑟 2, . . . , 𝑟𝑛 ],
however is lossy and out the scope of this study. Moreover, besides having 
the numeral values, we further introduce two other features on r𝑖−1 + 1, if 𝑣𝑖 = 𝑣𝑖−1,
r𝑖 = (1)
value and character for the text data type in Section 3. 1, otherwise 𝑣𝑖 ≠ 𝑣𝑖−1
Table 1 outlines the major features. For simplicity, we use TS = for 1 < 𝑖 ≤ 𝑛 and r1 = 1. Algorithms like SPRINTZ [20] in Section
[𝑣 1, 𝑣 2, . . . , 𝑣𝑛 ] to denote the value list of time series. Most of these 4.3 have a block size of 8 for bit-packing numbers into integer bytes.

2149
Table 2: Text data features Table 3: Properties of numerical data encoding algorithms

Category Notation Feature Encoding First value # Repeat RLE-based Diff-based


Value Exponent(TS) Exponent of Zipfian distribution TS_2DIFF X X
Value Domain(TS) Domain size of text values GORILLA X X
Character Length(TD) Length of text value RAKE X
Character Repeat(TD) # consecutive character repeats RLE X X
RLBE X X X X
SPRINTZ X X X
Therefore, we are interested in the values that repeat more than 8
times. The repeat count measure Count(RS) is thus 2 4 6 7 6 8 7 8 ···

Count(RS) = |{𝑟𝑖 | 𝑟𝑖 ≥ 8, 8 ≤ 𝑖 ≤ 𝑛}|. Delta encoding

2 2 2 1 -1 2 -1 1 ···
Second delta
2.4 Increase encoding IEJ×ÜÙÙ L Fs

While repetitive values have difference 0, the sign of difference for 3 3 3 2 0 3 0 2 ···
non-repetitive values is also concerned. The reason is that the non- First Value Bit-packing
zero sign bits may interfere encoding in algorithms like RLBE [41] 11 11 11 10 00 11 00 10 ···
introduced in Section 4.3. If all the difference signs are positive,
in other words, the time series values are always increasing, the
encoding performs better. In contrast, when the differential value Figure 2: Examples of TS_2DIFF encoding algorithm
is negative, i.e., decreasing, the encoding performance would be
bad. In this sense, we define Count(IS) the number of increasing
key factor. Longer values generally lead to larger character encod-
values with adjacent timestamps,
ing results. Again, repeats of characters also affect the encoding
Count(IS) = |{𝑣𝑖 | 𝑣𝑖 > 𝑣𝑖−1, 1 < 𝑖 ≤ 𝑛}|. (2) performance, such as RLE [27] and HUFFMAN [29, 36].

In addition to the features on scale, delta, repeat and increase, 4 NUMERICAL DATA ENCODING
data type is also an important factor that affects the encoding per- Referring to the aforesaid discussions on lossless requirement and
formance. For INT32 and INT64, similar values have smaller deltas system architecture, we introduce six encoding algorithms that are
than those of FLOAT and DOUBLE. Moreover, the longer INT64 proper to implement in Apache IoTDB, including TS_2DIFF [8],
and DOUBLE may have more 0 bits, where bit compacting strate- GORILLA [37], SPRINTZ [20], RLE [27], RLBE [41], and RAKE
gies may perform. Therefore, in the qualitative analysis in Table 4 [21]. The source code of implementation is available in the GitHub
and the quantitative evaluation such as Figure 9, the data types are repository of Apache IoTDB [5]. Table 3 lists the common ideas
also considered as important data features. that may share among different encoding algorithms. Specifically,
a qualitative analysis of encoding effectiveness regarding to vari-
3 TEXT DATA FEATURES ous data features is presented in Table 4.
Similarly, text time series data has several data features which may
be related to encoding performance, including the distribution of 4.1 Differential-based Encoding
values, the domain of values, the average length of text value and Differential encoding proposes to reduce the absolute value when
consecutive repeats of characters. Table 2 outlines the major text the data in time series is continuous, especially when the original
features. data is large. The number of significant bits reduces since the abso-
lute value is decreased, which reduces storage costs. Thereby, com-
3.1 Value pression ratio has an important relationship with delta features in
The text values often follow a Zipfian distribution [18, 45] in prac- differential encoding algorithms. While traditional differential en-
tice. The exponent of Zipfian distribution represents the frequency coding can only perform well in monotonous integer values, GO-
of values. The larger the exponent is, the larger the skewness of RILLA and TS_2DIFF two recent advances.
value frequency is. Such skewness affects the performance of HUFF- 4.1.1 TS_2DIFF. The TS_2DIFF encoding is a variant of delta-of-
MAN encoding [29, 36], which relates to value frequency. More- delta [37]. It consists of three steps: delta encoding, second delta
over, the domain size of text values is also important, e.g., to DIC- encoding, bit-packing. The first step calculates the delta of every
TIONARY encoding [44] that stores the value domain as dictio- value by subtracting the current value from the previous one. Note
nary. that the first value does not have the previous value and should
be stored directly. Then, the algorithm finds the minimum delta,
3.2 Character mindiff , and gets the final data to store by subtracting the delta
The character features could affect the encoding algorithms that from mindiff . Finally, the leading zeros of fixed length of binary
encode data at character level. The length of values is of course a data is removed to get the final encoded byte stream.

2150
Table 4: A qualitative analysis of encoding effectiveness regarding to various numerical data features

Large Large Large Large Vast Vast


Encoding INT32 INT64 FLOAT DOUBLE
value mean value variance delta mean delta variance repeats increases
TS_2DIFF X X X X × X ×
GORILLA X X ×
RAKE × X × X ×
RLE X X × X
RLBE X X × X X
SPRINTZ X X × × × X
X good performance, no preference, × bad performance

11 11 10 12 8 ··· 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 ···
INT32 Representation [8] and [9] are Run length
(only the last byte in this figure) repeat time encoding
00001011 00001011 00001010 00001100 00001000 ··· [8] 3 [9] 5 ···
XOR with previous Bit-packing with
Header: bit-width max bit-width 3
00001011 00000000 00000001 00000110 00000100 ···
3 [8] 011 [9] 101 ···
Encoding
First Value
11 µ ¶ µ ¶ µ ¶ µ ¶ 29 µ ¶ µ ¶ µ ¶ ··· Figure 4: Examples of RLE encoding algorithm
Bit-width: 32 1 2+5+6+1 2+5+6+2 2+2 ···

Figure 3: Examples of GORILLA encoding algorithm 4.2 Run-length-based Encoding


Run-length encoding (RLE) [27] targets on reducing the space cost
of adjacent repeating values. The compression effect of traditional
RLE algorithm is limited when times of repeating values are small.
RLE with bit-packing and RAKE [21] are more effective advances.
Therefore, the smaller variance and delta variance the sequence
4.2.1 RLE with bit-packing. RLE encoding [27] stores the contin-
has, the smaller bit-width the difference of the sequence has, and
uous repeating time of one element instead of repeating the same
the smaller the final compression ratio is. As summarized in Table
elements over and over. For example, a series 444556666 can be
4, TS_2DIFF is also suitable for large delta mean values, since the
stored as 435264 with run-length, where the number 3 after 4 de-
values can get a small value to store by subtracting large minimum
noting that 4 repeats 3 times. RLE introduces extra space cost to
in the second delta encoding process.
store repeat times when values are consecutive, thus IoTDB im-
Figure 2 shows a case of small delta variance, i.e., all the deltas
plementation combines RLE with bit-packing. Run-length is only
in the second delta are small and thus have lower space cost.
applied to values whose repeat time is larger than 8. Simple bit-
packing is implemented to others.
4.1.2 GORILLA. The GORILLA encoding is originally designed It is not surprising that RLE with bit-packing performs good
for Facebook’s time series database (TSDB) [37]. First, it processes when time series has vast repeats, as more values can be encoded
the timestamps with second order differential, which is effective into one value and its repeat time. Bit-packing reduces storage cost
when the values come in an almost fixed interval. The values are caused by data with few repeats. The algorithms perform better
divided into four areas by significant bit width. Then it writes the when repeat rate is high and value mean is low but positive.
packed timestamps and values. As for the value it uses the XOR For example, in Figure 4, the series has many consecutive repeat
coding method. Typically, this procedure results in many leading values which are also smaller positive numbers. Thereby, RLE with
and tailing zeros for float number. If the XOR result is zero, it only bit-packing performs.
writes a bit ‘0’ to represent it. Otherwise, it writes the different bits
and numbers of leading/tailing zeros of the result. 4.2.2 RAKE. The RAKE encoding [21] is based only on bits count-
As shown in Table 4, GORILLA is suitable for small variance ing operations. It considers a T-teeth rake to process 𝑇 bits every
data, as it increases the number of leading and tailing zeros in XOR time. If all the 𝑇 bits are zeros, a setting bit of 0 is stored. Otherwise,
results. On the other hand, it may fail on the time series with dras- it first stores a setting bit of 1. Then, a codeword of 𝐿 = ⌈log2 𝑇 ⌉
tic change, as more non-zero bits are used to encode the values. bits is generated according to the numbers in the rake. The code-
In Figure 3, GORILLA shows a good performance by compress- word records the position of the first 1, 𝑝 first , in binary notation.
ing 160 bits of 5 INT32 values into 66 bits. And, the rake shifts 𝑝 first + 1 bits to the right.
The time series data has a small variance and lots of leading and Therefore, we expect the ‘1’s of binary numbers to be more
tailing zeros. sparsely, so that T zeros can be compressed into one ‘0’. For INT64

2151
4 in Binary word First Value
··· 1040 ···
000100 0011 1011 1110 1001
Binary code Fibonacci 11, 14, 9 in binary words
length code
N!1040 010000010000 Setting Bit LÙÜåæç Code
I 1 1 101
II 0 - 0
Figure 7: Examples of INT32 and INT64 extensions for RLBE
III 1 1 101
IV 0 - 0
2 4 6 7 6 8 7 8 ···
Encoded data Delta encoding
2 2 2 1 -1 2 -1 1 ···
··· 101010101 ···
Zigzag encoding

3 3 3 1 2 3 2 1 ···
Figure 5: Examples of RAKE encoding algorithm
First Value Bit-packing
11 11 11 01 10 11 10 01 ···
1 2 3 7 10 ···
Differential coding
Figure 8: Examples of SPRINTZ encoding algorithm
1 1 1 4 3 ···
Binary encoding

1 1 1 100 11 ···
of length code, and sequentially followed by binary code words of
Calculate length
differential values with the same length.
1 1 1 3 2 ···
As summarized in Table 4, when the differential value is positive
Run-length
and small, RLBE performs good. RLBE performs bad when the dif-
3 1 1 ···
ferential value is negative, as the sign bit is ‘1’ and no leading ‘0’s
Fibonacci encoding
can be abolished. When adjacent differential values are of different
0011 11 11 ···
order of magnitude, i.e., variance is large, RLBE also performs bad
First Value Concatenation of length and data as run-length on length code cannot be applied.
00001 0011 1 1 1 00011 11 100 00010 11 11 ··· The examples are shown in Figure 6. While values are all increas-
ing in the time series data and all the deltas are positive, RLBE has
Figure 6: Examples of RLBE encoding algorithm a good performance such as the example in Figure 6.
To enlarge the encoding range, we extend the first 5 bits rep-
resenting binary code length to 6 bits, as shown in Figure 7. The
data, data has more leading zeros, and will be compressed more reason is that when differential value is negative, it has 32 mean-
efficiently than INT32 data, as summarized in Table 4. ingful bits, exceeding the representation range of 5 bits. Likewise,
Figure 5 shows a simple example of how values are encoded by when supporting integers of 64 bits, we expand length represent-
RAKE algorithm. Since the first 20 bits will be obviously encoded to ing binary code to 7 bits for the same reason.
five zeros, the process of compressing the first 20 bits is not shown.
For the number of Figure 5, a sparse number, 𝑁 = [010000010000] 4.3.2 SPRINTZ. The SPRINTZ encoding [20] combines encodings
is compressed by the RAKE algorithm (with 𝑇 = 4) to produce a in four steps: predicting, bit-packing, run-length encoding and en-
compressed sequence of 8 bits, [10101010]. tropy encoding. In the first step, it uses some predictive functions
(delta encoding or Fast Integer REgression encoding) to estimate
4.3 Hybrid Encoding the next coming value. Then it encodes the difference of the ac-
tual value and the predicted one. Typically, this step shrinks the
While the differential-based and run-length-based encoding algo-
absolute value to be encoded. Next, it bit-packs a block of resid-
rithms can perform well in different scenarios, there are certain
uals obtained in the first step. The largest number of significant
cases with both small delta features and vast repeats. To this end,
bits in the block is written in the header, and the leading zeros are
hybrid encoding with both ideas can achieve a better result, such
trimmed. Following that, run-length encoding and entropy encod-
as RLBE [41] and SPRINTZ [20].
ing (e.g., Huffman coding) are applied to reduce redundancy. Run-
4.3.1 RLBE. The RLBE encoding [41] proposes to combine delta, length coding compresses the consecutive zero blocks by recording
run-length and Fibonacci based encoding ideas. It has five steps: the number of zeros and entropy coding compresses the headers
differential coding, binary encoding, run-length, Fibonacci coding and payloads by encoding bytes in the form of Huffman coding.
[42] and concatenation. Specifically, delta encoding is first applied As summarized in Table 4, SPRINTZ algorithm is suitable for
to original data (integers of 32bits), and lengths of each differen- predictable time series. For delta function, the vast repeats or lin-
tial value (in binary notation) are calculated. Run-length is then early increasing time series is the best target. For the FIRE (Fast
applied to the length codes. In the concatenation phase, the first 5 Integer REgression) predictor, a constant slope is the best fit.
bits represent the length of binary words (the length is encoded in Since the example in Figure 8 has small value variance and delta
binary word), followed by the Fibonacci code words of repeat time mean, the SPRINTZ encoding has a good performance.

2152
Table 5: A qualitative analysis of encoding effectiveness re- Table 6: Parameters of numerical data generator for various
garding to various text data features data features

Large Large Large Vast Notation Data Features Range


Encoding
exponent domain length repeats 𝜇𝑣 Mean of values [−5 × 104, 5 × 104 ]
HUFFMAN X × × X 𝜇𝑑 Mean of deltas [−2000, 2000]
DICTIONARY × X 𝜎𝑑 Variance of deltas [0, 1000]
RLE X X 𝛾 Repeat rate [0, 1]
X good performance, no preference, × bad performance 𝜂 Increase rate [0, 1]

5 TEXT DATA ENCODING


In this section, we introduce 3 text encoding algorithms, includ-
ing DICTIONARY [44], HUFFMAN [29, 36] and RLE [27]. Table 5 6.1 Synthetic Numerical Data
presents a qualitative analysis of encoding effectiveness regarding To evaluate the effect of encoding algorithms working on different
to various text data features. data features, we design a data generator for varying data features,
controlled by 5 parameters. Table 6 lists the parameters, generally
5.1 DICTIONARY Encoding analogous to the data features in Section 2. The parameter 𝜇 𝑣 con-
trols the mean of values Mean(TS) in Table 1. For each data point,
The DICTIONARY algorithm [44] finds value in the dictionary. If
we employ a normal distribution with 𝜇𝑑 and 𝜎𝑑 to determine
the value is found successfully, the value is replaced by a key in the
its delta to the previous value, i.e., analogous to mean of deltas
dictionary; otherwise, the algorithm adds a new pair of key and
Mean(DS) and variance of deltas Var(DS) in Table 1. Since the
value in the dictionary. For example, if the map in the dictionary is
point has been already determined by this delta and its previous
{1:True, 2:False}, the time series TS = {True, False, True, True} could
value, we are not able to further control the variance or spread of
be encoded as 1211. Obviously, a large domain leads to higher cost
values. Nevertheless, they are related to the deltas as discussed to
in DICTIONARY encoding. In contrast, DICTIONARY favors large
certain extent. Repeat rate 𝛾 is the probability of generating a series
length values, by encoding it to a short key.
of consecutive points with repeated values, analogous to repeat
count Count(RS) defined in Section 2.3. Increase rate 𝜂 denotes
5.2 Run Length Encoding the probability of generating a point with value greater than the
Run-Length Encoding (RLE) [27] performs especially for data with previous, for the increase count feature Count(IS) in Section 2.4.
strings of repeated characters (the length of the string is called a Algorithm 1 presents the pseudo-code of the data generator. Let
run). The main idea of the algorithm is to encode repeated char- DS denote the delta series, as introduced in Section 2.2, to gener-
acters as a pair of the length of the repeated characters and the ate. Lines 3-6 generate repeats with probability 𝛾 specified in the
character. For example, the value ‘abbaaaaabaabbbaa’ of length 16 parameters. Likewise, Lines 8-12 generate an increase point with
bytes is represented as ‘1a2b5a1b2a3b2a’. However, if there are no probability 𝜂. The delta is given by a normal distribution with pa-
repeat characters in the value, the size of output data can be twice rameters 𝜇𝑑 and 𝜎𝑑 . Finally, the delta series DS is transformed to
as large as the size of input data. TS by a prefix summation, and zoom all the values to the target
value mean 𝜇 𝑣 .
5.3 HUFFMAN Encoding
The HUFFMAN encoding algorithm [29, 36] decreases the over- 6.2 Real-world Numerical Data
all length of the data by assigning shorter codewords to the more
Real-world datasets, public or collected by our industrial partners,
frequently occurring characters, employing a strategy of replac-
are also included in the benchmark, for learning the data features,
ing fixed length codes (such as ASCII) by variable length codes. It
and evaluating the encoding algorithms. Table 7 reports the major
creates a uniquely decipherable prefix-code precluding the need
statistics of the prepared datasets. Figure 12(b) below illustrates
for creation of a separator to determine codeword boundaries. For
their data features.
data with many high frequency values in skewed data distribution
MSRC-12 [10] is a dataset with float values from Microsoft Kinect
and many repeated characters, HUFFMAN’s ability to shorten high
gestures and has a low repeat rate and small delta variance due
frequency character encoding performs.
to small fluctuation. UCI-Gas [11] also consists of float values for
measuring gas concentration during chemical experiments and has
6 ENCODING BENCHMARK a low delta mean. WC-Vehicle contains sensor readings for moni-
To evaluate the encoding algorithms, we extend IoTDB-Benchmark toring vehicles and has a low repeat rate. TH-Climate is a dataset
[9], developed in our previous study, by introducing advanced data of weather information collected in Tsinghua campus. It has low
generator for various time series data features. Some real-world delta mean and high repeat rate. CW-AIOps is a dataset of applica-
data collected by our industrial partners are also employed, to- tion performance monitoring (APM) in cloud services, where the
gether with necessary metrics for evaluation. mean, variance and spread of both value and delta are very large

2153
Algorithm 1: Numerical data generator Table 7: Real-world numerical datasets
Data: 𝜇 𝑣 , 𝜇𝑑 , 𝜎𝑑 , 𝛾, 𝜂, length 𝑛
Result: TS Dataset Public # data points # time series
1 DS := empty_list(); MSRC-12 [10] 17,059 10
2 while |DS| < 𝑛 do UCI-Gas [11] 189,981 19
3 𝑖𝑠𝑅𝑒𝑝𝑒𝑎𝑡 := random_index(𝛾); WC-Vehicle 79,992 8
/* Probability of isRepeat == 1 is 𝛾, */ TH-Climate 1,317,330 140
/* probability of isRepeat == 0 is 1 − 𝛾 */ CW-AIOps 2,215,599 224
4 if 𝑖𝑠𝑅𝑒𝑝𝑒𝑎𝑡 then CS-Ship 89,991 9
5 𝑟𝑒𝑝𝑒𝑎𝑡_𝑙𝑒𝑛 := random(8, 𝑇 ); TY-Carriage 9,680,088 450
/* Get a random number in (8, 𝑇 ] */ WH-Chemistry 44,622 54
6 DS.append(0, 𝑟𝑒𝑝𝑒𝑎𝑡_𝑙𝑒𝑛); CR-Train 859,914 86
CB-Engine 533,901 88
/* Append 0 for 𝑟𝑒𝑝𝑒𝑎𝑡_𝑙𝑒𝑛 times */
7 else
8 𝑖𝑠𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 := random_index(𝜂);
9 𝑑𝑒𝑙𝑡𝑎 := 0; Table 8: Parameters of text data generator for various text
data features
10 if 𝑖𝑠𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 then
11 while 𝑑𝑒𝑙𝑡𝑎 ≤ 0 do
12 𝑑𝑒𝑙𝑡𝑎 := random_gauss(𝜇𝑑 , 𝜎𝑑 ); Notation Text Data Features Range
13 end 𝜃𝑣 Value of exponent [0, 10]
14 else 𝑁𝑣 Domain size of text values [1, 1500]
15 while 𝑑𝑒𝑙𝑡𝑎 ≥ 0 do ℓ𝑐 Average length of text value [100, 1100]
16 𝑑𝑒𝑙𝑡𝑎 := random_gauss(𝜇𝑑 , 𝜎𝑑 ); 𝛾𝑐 Repeat rate [0.9, 1]
17 end
18 end
19 DS.append(𝑑𝑒𝑙𝑡𝑎);
Lines 2-16 generate TD with domain size 𝑁 𝑣 , given the value length
20 end
ℓ𝑐 and character repeat rate 𝛾𝑐 . Then, lines 17-23 generate the dis-
21 end
tribution of values, under a Zipfian distribution with exponent 𝜃 𝑣 .
22 TS := prefix_sum(DS);
23 TS.zoom(𝜇 𝑣 );
6.4 Real-world Text Data
/* Zoom TS to adjust means to 𝜇 𝑣 */
24 return TS;
Table 9 presents several real-world text time series datasets. CW-
AIOps is a log dataset of application performance monitoring (APM)
in cloud services, collected by our industrial partners. Web Server
Access Logs, Incident Event Log Dataset and Web Log Dataset are
due to the complex application scenarios. CS-Ship monitors the sta- public datasets in Kaggle [15]. Among them, Web Server Access
tus of ship engines. Value mean and delta mean are small while in- Logs contain information on any event that was registered / logged.
crease is high. TY-Carriage contains readings of carriage monitor- Incident Event Log Dataset is a event log dataset of a website. Web
ing sensors, and it has low delta mean. WH-Chemistry is a dataset Log Dataset is the server log dataset of RUET OJ.
from chemical plant. It has high value mean, value variance, value
spread, delta mean, delta variance and delta spread. CR-Train is a
6.5 Evaluation Metrics
dataset from metro system, with low delta mean and high repeat
rate. CB-Engine consists of senor readings in concrete mixer. It has To measure the performance of encode algorithms for time series,
low delta mean, delta variance and repeat rate. two aspects are considered, compression ratio in space cost, encod-
ing and decoding time cost.
6.3 Synthetic Text Data
6.5.1 Compression Ratio. It measures the ratio of compressed (en-
The text data generator considers 4 parameters in Table 8, corre- coded) data size to uncompressed (non-encoded) data size
sponding to 4 features listed in Table 2 in Section 3. The exponent
𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑒𝑑𝑆𝑖𝑧𝑒
𝜃 𝑣 determines the skewness of value distribution. 𝑁 𝑣 is the domain compressionRatio = 𝑢𝑛𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑒𝑑𝑆𝑖𝑧𝑒 .
size of text values. These two parameters decide the distribution
of values. Moreover, ℓ𝑐 represents the average length of all the text 6.5.2 Time Cost. We report two aspects of time cost. The insert
values, and repeat rate 𝛾𝑐 is the probability of generating consecu- time measures the total cost of inserting a time series, including
tive character repeats. adding to memTable, flushing from memory to disk with sorting,
Algorithm 2 shows the pseudo-code of text data generator. Let encoding, and compressing. The select time measures the total cost
TS be the time series and TD be the value domain to generate TS. of querying a time series, with decompressing and decoding.

2154
Algorithm 2: Text data generator TS_2DIFF RAKE RLBE PLAIN
GORILLA RLE SPRINTZ
Data: 𝜃 𝑣 , 𝑁 𝑣 , ℓ𝑐 , 𝛾𝑐 , length 𝑛
(a) INT32 (b) INT64
Result: TS 1.0 1.0
1 TS := empty_list()
0.8 0.8
2 TD := new string[N]

Compression Ratio

Compression Ratio
3 for 𝑖 ∈ [0, 𝑁 𝑣 ] do
0.6 0.6
4 for 𝑗 ∈ [0, ℓ𝑐 ] do
5 if 𝑗 = 0 then 0.4 0.4
6 TDi .append(rand_Char())
7 continue 0.2 0.2

8 end
0.0 0.0
9 𝑖𝑠𝑅𝑒𝑝𝑒𝑎𝑡 := random_index(𝛾𝑐 ) NONE SNAPPY LZ4 GZIP NONE SNAPPY LZ4 GZIP
Compression Compression
/* Probability of isRepeat == 1 is 𝛾𝑐 , 0.8
(c) FLOAT
1.0
(d) DOUBLE

probability of isRepeat == 0 is 1 − 𝛾𝑐 */
10 if 𝑖𝑠𝑅𝑒𝑝𝑒𝑎𝑡 then 0.8

Compression Ratio

Compression Ratio
0.6
11 TDi .append(TDi .back())
0.6
12 else 0.4
13 TDi .append(rand_Char_except(TDi .back())) 0.4
14 end
0.2
15 end 0.2

16 end
0.0 0.0
1
( )𝜃𝑣 NONE SNAPPY LZ4 GZIP NONE SNAPPY LZ4 GZIP
17 num := {𝑛𝑢𝑚 0, ..., 𝑛𝑢𝑚 𝑁 𝑣 −1 }, where 𝑛𝑢𝑚𝑖 = Í𝑁𝑣 −1𝑖+1 1 𝜃𝑣 𝑛 Compression Compression
𝑗 =0 ( 𝑗 +1 )
18 for 𝑖 ∈ [0, 𝑁 𝑣 ] do
19 for 𝑗 ∈ [0, 𝑛𝑢𝑚𝑖 ] do Figure 9: Compression ratio over all numerical datasets
20 TS.append(TDi )
21 end
22 end 7.1 Real-world Numerical Data Evaluation
23 TS = random_permutation(TS) return TS;
Figures 9, 10 and 11 report the compression ratio, insert time and
select time defined in Section 6.5, respectively. The experiments are
Table 9: Real-world text datasets conducted over all 1088 time series in Table 7. In boxplot chart, ev-
ery element represents one time series. We study 28 combinations
Dataset Public # points # series of 7 encoding schemes and 4 compression schemes on 4 different
data types. In Figure 9, the lower the metrics are, the better the per-
CW-AIOps 2,215,599 224 formance is. Since the compression ratios are the same in different
Web Server Access Logs [12] 10,365,152 1 runs, we do not repeat the experiments of compression ratio. The
Incident Event Log [13] 141,712 35 experiments on time cost are repeated 50 times.
Web Log [14] 258,441 4
7.1.1 Comparison. As illustrated in Figure 9, TS_2DIFF encoding
achieves good (low) compression ratio, with or without compres-
7 EXPERIMENTAL EVALUATION sion. RAKE encoding performs even worse than PLAIN (no encod-
In this section, we conduct the experimental evaluation of encod- ing) when handling INT32 and FLOAT.
ing algorithms analyzed in Section 4. Note that encoding and com- As introduced in Section 4.2.2, when there are many consecu-
pressing are complementary. Therefore, we study PLAIN (no en- tive 0 bits, RAKE performs. The more the 1 bits are, the worse the
coding), TS_2DIFF [8], GORILLA [37], SPRINTZ [20], RLE [27], RAKE algorithm performs. For INT32 and FLOAT, since there are
RLBE [41], and RAKE [21] encoding, combined with NONE (no less 0 bits than INT64 and DOUBLE for the same values, the perfor-
compression), SNAPPY [38], GZIP [16] or LZ4 [19] compressor. mance of RAKE is worse. In particular, for negative numbers with
We present their performances over various data types including a small absolute value, owing to the leading sign bit ‘1’ and more
INT32, INT64, FLOAT and DOUBLE as listed in Table 4. With the leading ‘1’s, RAKE may perform even worse given the extra cost
help of benchmark in Section 6, the results on both the real-world of setting bits and so on. Similar results are also observed in Figure
and synthetic data with various features are reported. 14(a), where RAKE again performs worse than PLAIN on negative
The experiments run on a machine with 2.1 GHz Intel(R) Xeon(R) numbers (with value mean less than 0).
CPU and 128GB RAM. While the source code of encoding algo- GORILLA performs better on INT32 and INT64 than FLOAT and
rithms has been deployed in the GitHub repository of Apache IoTDB DOUBLE since positions of leading and trailing 0 are more similar.
[5], the experiment related code and data are available in [6]. The results verify the qualitative analysis on data types in Table 4.

2155
TS_2DIFF RAKE RLBE PLAIN
TS_2DIFF RAKE RLBE PLAIN
GORILLA RLE SPRINTZ
GORILLA RLE SPRINTZ
(a) INT32 (b) INT64
0.26 0.20

0.24 0.19 0.8

0.7
0.22 0.18

Compression Ratio
Insert Time

Insert Time
0.6
0.20 0.17
0.5
0.18 0.16
0.4
0.15
0.16
0.3
0.14
0.14 0.2
0.13
0.12 0.1
NONE SNAPPY LZ4 GZIP NONE SNAPPY LZ4 GZIP
Compression Compression 0.0
(c) FLOAT (d) DOUBLE

12

as

ge
e

ps

CR ry

in

e
cl

at

in
hi
0.19

ra
t
-G

IO

ria
C-

is
hi

ng
-S
lim

-T
m
-A
SR

CI

Ve
0.20

ar
CS

-E
-C

he
CW
U

C-

-C
M

CB
TH
0.18

-C
TY
W

H
W
0.17 0.18 Dataset
Insert Time

Insert Time

0.16 (a) Compression ratio of each dataset


0.16
0.15
Value mean Value variance Value spread Repeat
0.14 Delta mean Delta variance Delta spread Increase
0.14
0.13
108
NONE SNAPPY LZ4 GZIP NONE SNAPPY LZ4 GZIP
Compression Compression 106

104
Figure 10: Insert time over all numerical datasets
value

102

100
TS_2DIFF RAKE RLBE PLAIN
GORILLA RLE SPRINTZ 10− 2

(a) INT32 (b) IN64 10− 4


0.040
0.05
0.035
12

as

ge
le

ps

CR ry

in

e
at

in
hi
ic

ra
t
-G

IO

ria
C-

is

ng
-S
lim
h

-T
m
-A
SR

CI

Ve

ar
CS

-E
-C

he
CW
U

0.030
C-

0.04

-C
M

CB
TH

-C
TY
W
Select Time

Select Time

H
W
0.025 Dataset
0.03
0.020
(b) Feature value of each dataset
0.02
0.015

0.010 0.01 Figure 12: Compression ratio and features on each dataset
NONE SNAPPY LZ4 GZIP NONE SNAPPY LZ4 GZIP
Compression Compression
(c) FLOAT (d) DOUBLE
0.035 7.1.2 Individual Datasets. In addition to the qualitative analysis
0.030 0.04 on data types, we further validate the effects of data features an-
alyzed in Table 4. Figure 12(a) reports the compression ratio of 7
encoding schemes without compression applied (NONE). Figure
Select Time

Select Time

0.025
0.03

0.020
12(b) shows the corresponding 8 data features listed in Table 1.
In general, TS_2DIFF encoding still achieves good performance,
0.02
0.015 while RAKE could be worse than PLAIN without encoding, similar
0.010
as the observations in Figure 11. For the datasets with large delta
0.01
mean, such as UCI-Gas, TH-Climate, MSRC-12, CS-Ship and TY-
NONE SNAPPY LZ4 GZIP NONE SNAPPY LZ4 GZIP Carriage, TS_2DIFF still performs well. For the datasets with small
Compression Compression
value variance and delta mean, such as WC-Vehicle and CB-Engine,
GORILLA performs better. RLBE performs better in CS-Ship than
Figure 11: Select time over all numerical datasets
other datasets, since CS-Ship has relatively smaller delta mean and
delta variance. The results verify again the analysis in Table 4.
Nevertheless, with compression, the space cost is further reduced. Note that both time and value series are encoded and compressed,
The improved by compression after TS_2DIFF encoding, however, and the statistics stored in the PageHeader consider both time and
is limited, with extra compressing and decompressing time. value series compression. Since time is encoded and compressed

2156
TS_2DIFF RAKE RLBE PLAIN
TS_2DIFF RAKE RLBE PLAIN
GORILLA RLE SPRINTZ
GORILLA RLE SPRINTZ
(a) Value Mean (b) Value Mean (c) Value Mean
0.012
(a)GZIP (b)LZ4 0.8 0.15
DT DT 0.010

Compression Ratio

Select Time(s)
Insert Time(s)
0.6 0.008
0.10
CT CT 0.006
1.0 1.0 0.4
0.8 0.8 0.05 0.004
0.6 0.6
0.4 0.4
0.2 0.2 0.2 0.002
0.0 0.0
ET ET 0.00 0.000
-50000 0 50000 -50000 0 50000 -50000 0 50000
Value Mean Value Mean Value Mean

UT UT Figure 14: Varying scale feature of value mean 𝜇 𝑣

CR CR TS_2DIFF RAKE RLBE PLAIN


GORILLA RLE SPRINTZ
(c)SNAPPY (d)NONE
DT DT (a) Delta Mean (b) Delta Mean
0.012
(c) Delta Mean
0.15
0.6
0.010

Compression Ratio
CT CT

Select Time(s)
Insert Time(s)
0.5 0.008
1.0 1.0 0.10
0.8 0.8
0.6 0.6 0.4 0.006
0.4 0.4
0.2 0.2
0.0 0.0 0.3 0.05 0.004
ET ET
0.002
0.2
0.00 0.000
-500 0 500 -500 0 500 -500 0 500
Delta Mean Delta Mean Delta Mean
UT UT

Figure 15: Varying delta feature of delta mean 𝜇𝑑


CR CR

TS_2DIFF RAKE RLBE PLAIN


Figure 13: Trade-off between time and compression ratio GORILLA RLE SPRINTZ

(a) Delta Variance (b) Delta Variance (c) Delta Variance


0.012
0.15
0.6
0.010
Compression Ratio

by default, the compression ratio of PLAIN encoding and NONE

Select Time(s)
Insert Time(s)
0.008
0.10
compression on value series together with time is less than 1. 0.4
0.006

7.1.3 Trade-off. Figure 13 breaks down the time costs into encod- 0.2 0.05 0.004

ing time (ET), decoding time (DT), compression time (CT), uncom- 0.002
0.0 0.00 0.000
pression time (UT) and the corresponding compression ratio (CR) 0 500 1000 0 500 1000 0 500 1000
Delta Variance Delta Variance Delta Variance
of different encoding algorithms together with four compression
strategies. The experiments run on all the real-world datasets in
Figure 16: Varying delta feature of delta variance 𝜎𝑑
Table 7 and report the average. For each dimension, we normalize
the results into a range between 0 and 1, the larger the better. For
TS_2DIFF RAKE RLBE PLAIN
ET, DT, CT and UT, a larger value represents lower time compared GORILLA RLE SPRINTZ
with other encoding algorithms, i.e., more efficient. For compres- (a) Repeat (b) Repeat (c) Repeat
0.012
sion ratio (CR), a larger metric represents lower compression ratio, 0.6
0.15
0.010
Compression Ratio

again better compression performance.


Select Time(s)
Insert Time(s)

0.008
0.10
As shown, most encoding algorithms are efficient in encoding 0.4
0.006
(ET). TS_2DIFF has better compression ratio (CR) as well as com- 0.004
0.2 0.05
pression time (CT) and uncompression time (UT), but the corre- 0.002
sponding decoding time (DT) is worse. GORILLA with both better 0.0 0.00 0.000
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
encoding and decoding time (ET and DT) has worse compression Repeat Repeat Repeat

ratio (CR). Similar trade-off results are also observed in Figure 13(d)
without compression (NONE). Figure 17: Varying repeat feature of repeat rate 𝛾

7.2 Varying Numerical Data Features


report the results on INT32. Due to the limited space, the similar
While the experiments on real data in Section 7.1 cover only part
results on other setting combinations are omitted.
of the data features analyzed in Table 4, in order to conduct a more
extensive quantitative analysis, we further evaluate the encoding 7.2.1 Compression Ratio. Figure 14(a) varies the scale feature of
algorithm over the synthetic data with various features controlled value mean 𝜇 𝑣 introduced in Table 6. RAKE and RLE perform bet-
by the data generator presented in Algorithm 1. Again, we compare ter when value mean is positive. For negative value mean, sign
7 encoding schemes without compression (NONE). Figures 14-18 bits are 1, i.e., compression cannot be applied on the first 4 bits of

2157
TS_2DIFF RAKE RLBE PLAIN HUFFMAN DICTIONARY RLE PLAIN
GORILLA RLE SPRINTZ (a) Compression Ratio (b) Insert Time (c) Select Time
(a) Increase (b) Increase (c) Increase 4 0.8
0.012 2.5

Compression Ratio
0.15
0.6 0.010 2.0
Compression Ratio

Select Time
Insert Time
0.6

Select Time(s)
Insert Time(s)
0.008 1.5
0.10 2
0.4 0.006 1.0
0.4
0.004 1
0.05 0.5
0.2 0.002
0 0.0
NONESNAPPY LZ4 GZIP NONESNAPPY LZ4 GZIP NONESNAPPY LZ4 GZIP
0.00 0.000
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Compression Compression Compression
Increase Increase Increase

Figure 19: Performance of text encoding on real datasets


Figure 18: Varying increase feature of increase rate 𝜂

HUFFMAN DICTIONARY RLE PLAIN


leading zeros. TS_2DIFF and RLBE are less affected by value mean, (a) Exponent
1.0
(b) Exponent
4
(c) Exponent
1.25
since run-length and differential encoding store the length of re-

Compression Ratio
1.00 0.8
peat times and differential value instead of storing lots of large val- 3

Select Time(s)
Insert Time(s)
ues. GORILLA is unstable, as its performance is affected by XOR, 0.75 0.6
2
unrelated to value mean. 0.50 0.4
For delta features, Figure 15(a) varies delta mean 𝜇𝑑 and Figure 0.25 1
0.2
16(a) considers delta variance 𝜎𝑑 . TS_2DIFF performance decreases
0.00
with the increase of delta variance in Figure 16(a), which is not 0 5 10
0.0
0 5 10
0
0 5 10
surprising referring to the analysis in Table 4. Exponent Exponent Exponent

When repeat rate 𝛾 increases in Figure 17(a), RLE, RLBE and


SPRINTZ perform better. They are run-length based algorithms, Figure 20: Varying text feature of exponent 𝜃 𝑣
favoring high repeats. GORILLA also performs better with repeat
rates increasing, since XOR has more zeros and only a bit of ‘0’
HUFFMAN DICTIONARY RLE PLAIN
needs to be stored. RAKE’s performance is not affected due to its (a) Domain (b) Domain (c) Domain
1.0 4
run-length is on bits, instead of value repeats. 1.0

Figure 18(a) varies the increase rate 𝜂. When increase rate be-
Compression Ratio

0.8 0.8
3

Select Time(s)
Insert Time(s)
comes larger, more positive values appear, beneficial to bit-pack. 0.6 0.6
Thereby, RLBE is positively correlated with the increase rate. 2
0.4 0.4
7.2.2 Time Cost. Analogous to the compression ratio results in 0.2 1
0.2
Section 7.2.1, Figures 14(b), 15(b), 16(b), 17(b) and 18(b), report the
0.0
insert time under various value mean 𝜇 𝑣 , delta mean 𝜇𝑑 , delta vari- 0.0 0
0 1000 0 1000 0 1000
Domain Domain Domain
ance 𝜎𝑑 , repeat rate 𝛾 and increase rate 𝜂, respectively. Likewise,
Figures 14(c)-18(c) are the corresponding select time. Due to the
limited space, similar results on other setting combinations are Figure 21: Varying text feature of domain size 𝑁 𝑣
omitted. Each test is conducted 50 times and reports the average.
As shown, while the compression ratio is affected largely by var-
ious data features in Figures 14(a)-18(a), the corresponding insert limited repeated characters as analyzed in Section 5.2. When com-
time and select time are stable. Different encoding methods indeed pression is applied, either SNAPPY, LZ4 or GZIP, the DICTIONARY
lead to very close insert and select time. Similar to Figures 10 and encoding has almost the best compression ratio and time cost.
11, the insert time is much higher than the select time under var-
ious data features. Due to the extremely low select time, the vari- 7.4 Varying Text Data Features
ances of select time in the tests are relatively larger. The results are Similar to the numeric data, we generate text data with different
generally consistent with those on real datasets in Section 7.1. features mentioned in Section 3, by Algorithm 2, and evaluate the
encoding performance on the synthetic text data.
7.3 Real-world Text Data Evaluation For value features, Figure 20 varies the exponent 𝜃 𝑣 introduced
Figure 19 reports compression ratio, insert time and select time in Table 8. The larger the exponent is, i.e., the data distribution is
defined in Section 6.5, respectively, over all text datasets in Table 9. more skewed, the better HUFFMAN performs. The improvement
We perform the 16 combinations of 4 text encoding schemes and 4 of compression ratio however is not significant. The other algo-
general compression schemes on text data. rithms are not affected by the exponent, as also analyzed in Table 5.
As shown in Figure 19, when there is no compression algorithm Figure 21 varies the domain size 𝑁 𝑣 introduced in Table 8. It
applied, HUFFMAN has the best performance in compression ra- is not surprising that DICTIONARY performs worse with the in-
tio, but it has the worst performance in time cost. RLE shows even crease of domain size. In contrast, DICTIONARY favors a larger
worse compression ratio than PLAIN (no encoding), owing to the value length as illustrated in Figure 22, with a slight improvement.

2158
HUFFMAN DICTIONARY RLE PLAIN 8.2 Lossy Encoding
(a) Length (b) Length (c) Length
1.25 1.0 4 While we focus on lossless encoding required in databases, the
lossy encoding is also practical in other applications especially in
Compression Ratio

1.00 0.8
3

Select Time(s)
Insert Time(s)
0.75 0.6
end or edge devices. Plato [31] proposes to reduce noise. ODH
2 [30] adopts different lossy algorithms to encode the linear and
0.50 0.4
non-linear data, respectively. Eichinger et al. [25] propose to es-
1
0.25 0.2 timate time series in piecewise polynomial by applying a greedy
0.00
0.0 0
method and three different online regression algorithms, includ-
500 1000 500 1000 500 1000
Length Length Length
ing a PMR-Midrange algorithm [33], an optimal approximation al-
gorithm [24], and a randomized algorithm [39], for approximat-
Figure 22: Varying text feature of value length ℓ𝑐 ing constant functions, straight lines, and polynomials. Fink and
Gandhi [26] introduce an algorithm to encode time series by ex-
ploiting time series maxima and minima. TRISTRAN [34] and COR-
HUFFMAN DICTIONARY RLE PLAIN
AD [32] use autocorrelation in one or multiple time series to im-
(a) Repeat (b) Repeat (c) Repeat prove compression ratio and accuracy.
1.0 4
1.0
Compression Ratio

0.8 0.8
3 8.3 General Compression
Select Time(s)
Insert Time(s)

0.6 0.6 In Apache IoTDB, a compression step for general data is applied
2
0.4 0.4 after the time series is encoded, i.e., complementary. The compres-
0.2 1 sion algorithms implemented in Apache IoTDB, GZIP [16], SNAPPY
0.2
[38] and LZ4 [19], all originate from LZ77 [48], looking for the
0.0
0.90 0.95 1.00
0.0
0.90 0.95 1.00
0
0.90 0.95 1.00 longest match string using a sliding window on the input stream.
Repeat Repeat Repeat
Nevertheless, the results in Figure 9 show that TS_2DIFF encoding
is already efficient, while further applying general purpose com-
Figure 23: Varying text feature of repeat rate 𝛾𝑐 pression cannot reduce the space cost further.

9 CONCLUSION
In Figure 23, the compression ratio of RLE significantly improves In this paper, we provide both qualitative and quantitative analy-
when the character repeat rate 𝛾𝑐 is large, as analyzed in Table 5 sis of time series encoding algorithms regarding to various data
and Section 5.2. However, such character repeats may not be preva- features. The comparison is conducted in Apache IoTDB, an open-
lent in practice as illustrated in Figure 19(a). source time series database developed in our preliminary study
Due to more characters, insert time significantly increases while [43]. First, we profile several features that may affect the perfor-
length is increasing. And insert time is almost unchanged with ex- mance of encoding. The qualitative and quantitative analysis is
ponent, domain and repeat varying. thus built on these data features. To evaluate the encoding algo-
Since HUFFMAN algorithm needs to recover Huffman tree in rithms, we present a benchmark with real-world data and a data
the process of selecting data, it has significantly higher select time. generator for various features.
When repeat becomes larger, the Huffman tree becomes smaller, We notice that different encoding algorithms favor various data
and select time decreases in Figure 23(c). In contrast, with the in- features. It motivates us to recommend distinct encoding algorithms
crease of length, the Huffman tree becomes larger, and select time referring to the features for different datasets. While some prelimi-
increases in Figure 22(c). nary results of encoding recommender are presented in Appendix
A in [17], it is expected to improve the recommender further. For in-
stance, one may employ more advanced machine learning models
8 RELATED WORK to train a more accurate recommender. Incremental and transfer
While this study focuses on encoding methods that are proper to learning could also be applied, to address evolving data features
implement in time series database management systems, there do and generalize over never-seen datasets. In addition to pursuing
exist many other alternatives (see [22] for a survey). more concise encoding, one may also expect to balance the space
cost and the time cost of efficient query processing.
8.1 Lossless Encoding
In addition to the lossless encoding algorithms studied in this pa- ACKNOWLEDGMENTS
per, the dictionary-based algorithms [32, 34] are not practical to This work is supported in part by the National Natural Science
implement for numerical values, since the dictionaries could be Foundation of China (62072265, 62021002), the National Key Re-
too large to store in the PageHeader of Apache IoTDB. Similarly, search and Development Plan (2021YFB3300500, 2019YFB1705301,
machine learning based lossless encoding, such as [47] consisting 2019YFB1707001), Beijing National Research Center for Informa-
of a transform stage and an encoding stage, needs to conduct rein- tion Science and Technology (BNR2022RC01011), and Alibaba Group
forcement learning. Not only the models are too large to store, but through Alibaba Innovative Research (AIR) Program. Shaoxu Song
also the learning is too heavy to process inside databases. (https://sxsong.github.io/) is the corresponding author.

2159
REFERENCES Vijayaraman, editors, Proceedings of the 19th International Conference on Data
[1] https://iotdb.apache.org/. Engineering, March 5-8, 2003, Bangalore, India, pages 429–440. IEEE Computer
[2] https://www.influxdata.com/. Society, 2003.
[3] http://opentsdb.net/. [34] Alice Marascu, Pascal Pompey, Eric Bouillet, Michael Wurst, Olivier Verscheure,
[4] https://prometheus.io/. Martin Grund, and Philippe Cudré-Mauroux. TRISTAN: real-time analytics on
[5] https://github.com/apache/iotdb/tree/research/encoding-exp. massive time series using sparse dictionary compression. In Jimmy Lin, Jian Pei,
[6] https://github.com/xjz17/iotdb/tree/TSEncoding. Xiaohua Hu, Wo Chang, Raghunath Nambiar, Charu C. Aggarwal, Nick Cercone,
[7] https://thulab.github.io/iotdb-quality/. Vasant G. Honavar, Jun Huan, Bamshad Mobasher, and Saumyadipta Pyne, edi-
[8] https://iotdb.apache.org/UserGuide/Master/Data-Concept/Encoding.html. tors, 2014 IEEE International Conference on Big Data (IEEE BigData 2014), Wash-
[9] https://github.com/thulab/iotdb-benchmark. ington, DC, USA, October 27-30, 2014, pages 291–300. IEEE Computer Society,
[10] https://www.microsoft.com/en-us/download/details.aspx. 2014.
[11] https://archive.ics.uci.edu. [35] V. Krishna Nandivada and Rajkishore Barik. Improved bitwidth-aware variable
[12] https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs. packing. ACM Trans. Archit. Code Optim., 10(3):16:1–16:22, 2013.
[36] Ghim Hwee Ong and Shell-Ying Huang. A data compression scheme for chinese
[13] https://www.kaggle.com/datasets/winmedals/incident-event-log-dataset.
text files using huffman coding and a two-level dictionary. Inf. Sci., 84(1&2):85–
[14] https://www.kaggle.com/datasets/shawon10/web-log-dataset.
99, 1995.
[15] https://www.kaggle.com/datasets/.
[37] Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin
[16] https://www.gnu.org/software/gzip/.
Teller, and Kaushik Veeraraghavan. Gorilla: A fast, scalable, in-memory time
[17] https://sxsong.github.io/doc/encoding.pdf.
series database. Proc. VLDB Endow., 8(12):1816–1827, 2015.
[18] Anders Aamand, Piotr Indyk, and Ali Vakilian. (learned) frequency estimation
[38] Horst Samulowitz, Chandra Reddy, Ashish Sabharwal, and Meinolf Sellmann.
algorithms under zipfian distribution. CoRR, abs/1908.05198, 2019.
Snappy: A simple algorithm portfolio. In Matti Järvisalo and Allen Van Gelder,
[19] Matej Bartik, Sven Ubik, and Pavel Kubalík. LZ4 compression algorithm on
editors, Theory and Applications of Satisfiability Testing - SAT 2013 - 16th Inter-
FPGA. In 2015 IEEE International Conference on Electronics, Circuits, and Systems,
national Conference, Helsinki, Finland, July 8-12, 2013. Proceedings, volume 7962
ICECS 2015, Cairo, Egypt, December 6-9, 2015, pages 179–182. IEEE, 2015.
of Lecture Notes in Computer Science, pages 422–428. Springer, 2013.
[20] Davis W. Blalock, Samuel Madden, and John V. Guttag. Sprintz: Time series
[39] Raimund Seidel. Small-dimensional linear programming and convex hulls made
compression for the internet of things. CoRR, abs/1808.02515, 2018.
easy. Discret. Comput. Geom., 6:423–434, 1991.
[21] Giuseppe Campobello, Antonino Segreto, Sarah Zanafi, and Salvatore Serrano.
[40] Bin Song, Limin Xiao, Guangjun Qin, Li Ruan, and Shida Qiu. A deduplication
RAKE: A simple and efficient lossless compression algorithm for the internet of
algorithm based on data similarity and delta encoding. In Hanning Yuan, Jing
things. In 25th European Signal Processing Conference, EUSIPCO 2017, Kos, Greece,
Geng, and Fuling Bian, editors, Geo-Spatial Knowledge and Intelligence - 4th In-
August 28 - September 2, 2017, pages 2581–2585. IEEE, 2017.
ternational Conference on Geo-Informatics in Resource Management and Sustain-
[22] Giacomo Chiarot and Claudio Silvestri. Time series compression: a survey.
able Ecosystem, GRMSE 2016, Hong Kong, China, November 18-20, 2016, Revised
CoRR, abs/2101.08784, 2021.
Selected Papers, Part II, volume 699 of Communications in Computer and Infor-
[23] E. F. Codd. Relational database: A practical foundation for productivity. Com-
mation Science, pages 245–253. Springer, 2016.
mun. ACM, 25(2):109–117, 1982.
[41] Julien Spiegel, Patrice Wira, and Gilles Hermann. A comparative experimen-
[24] Marco Dalai and Riccardo Leonardi. Approximations of one-dimensional digital
tal study of lossless compression algorithms for enhancing energy efficiency in
signals under the linfty norm. IEEE Trans. Signal Process., 54(8):3111–3124, 2006.
smart meters. In 16th IEEE International Conference on Industrial Informatics,
[25] Frank Eichinger, Pavel Efros, Stamatis Karnouskos, and Klemens Böhm. A time- INDIN 2018, Porto, Portugal, July 18-20, 2018, pages 447–452. IEEE, 2018.
series compression technique and its application to the smart grid. VLDB J., [42] Jirí Walder, Michal Krátký, and Jan Platos. Fast fibonacci encoding algorithm.
24(2):193–218, 2015. In Jaroslav Pokorný, Václav Snásel, and Karel Richta, editors, Proceedings of the
[26] Eugene Fink and Harith Suman Gandhi. Compression of time series by extract- Dateso 2010 Annual International Workshop on DAtabases, TExts, Specifications
ing major extrema. J. Exp. Theor. Artif. Intell., 23(2):255–270, 2011. and Objects, Stedronin-Plazy, Czech Republic, April 21-23, 2010, volume 567 of
[27] Solomon W. Golomb. Run-length encodings (corresp.). IEEE Trans. Inf. Theory, CEUR Workshop Proceedings, pages 72–83. CEUR-WS.org, 2010.
12(3):399–401, 1966. [43] Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang,
[28] Muon Ha and Yulia A. Shichkina. Translating a distributed relational database Rong Kang, Julian Feinauer, Kevin Mcgrail, Peng Wang, Diaohan Luo, Jun Yuan,
to a document database. Data Sci. Eng., 7(2):136–155, 2022. Jianmin Wang, and Jiaguang Sun. Apache iotdb: Time-series database for inter-
[29] Paul G. Howard and Jeffrey Scott Vitter. Parallel lossless image compression net of things. Proc. VLDB Endow., 13(12):2901–2904, 2020.
using huffman and arithmetic coding. Inf. Process. Lett., 59(2):65–73, 1996. [44] Terry A. Welch. A technique for high-performance data compression. Computer,
[30] Sheng Huang, Yaoliang Chen, Xiaoyan Chen, Kai Liu, Xiaomin Xu, Chen Wang, 17(6):8–19, 1984.
Kevin Brown, and Inge Halilovic. The next generation operational data histo- [45] Raymond Chi-Wing Wong and Ada Wai-Chee Fu. Mining top-k itemsets over
rian for iot based on informix. In Curtis E. Dyreson, Feifei Li, and M. Tamer a sliding window based on zipfian distribution. In Hillol Kargupta, Jaideep Sri-
Özsu, editors, International Conference on Management of Data, SIGMOD 2014, vastava, Chandrika Kamath, and Arnold Goodman, editors, Proceedings of the
Snowbird, UT, USA, June 22-27, 2014, pages 169–176. ACM, 2014. 2005 SIAM International Conference on Data Mining, SDM 2005, Newport Beach,
[31] Yannis Katsis, Yoav Freund, and Yannis Papakonstantinou. Combining CA, USA, April 21-23, 2005, pages 516–520. SIAM, 2005.
databases and signal processing in plato. In Seventh Biennial Conference on Inno- [46] Retaj Yousri, Madyan Alsenwi, M. Saeed Darweesh, and Tawfik Ismail. A design
vative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4-7, 2015, for an efficient hybrid compression system for eeg data. In 2021 International
Online Proceedings. www.cidrdb.org, 2015. Conference on Electronic Engineering (ICEEM), pages 1–6, 2021.
[32] Abdelouahab Khelifati, Mourad Khayati, and Philippe Cudré-Mauroux. CORAD: [47] Xinyang Yu, Yanqing Peng, Feifei Li, Sheng Wang, Xiaowei Shen, Huijun Mai,
correlation-aware compression of massive time series using sparse dictionary and Yue Xie. Two-level data compression using machine learning in time series
coding. In Chaitanya Baru, Jun Huan, Latifur Khan, Xiaohua Hu, Ronay Ak, database. In 36th IEEE International Conference on Data Engineering, ICDE 2020,
Yuanyuan Tian, Roger S. Barga, Carlo Zaniolo, Kisung Lee, and Yanfang Fanny Dallas, TX, USA, April 20-24, 2020, pages 1333–1344. IEEE, 2020.
Ye, editors, 2019 IEEE International Conference on Big Data (IEEE BigData), Los [48] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data com-
Angeles, CA, USA, December 9-12, 2019, pages 2289–2298. IEEE, 2019. pression. IEEE Trans. Inf. Theory, 23(3):337–343, 1977.
[33] Iosif Lazaridis and Sharad Mehrotra. Capturing sensor-generated time series
with quality guarantees. In Umeshwar Dayal, Krithi Ramamritham, and T. M.

2160

You might also like