Data-Aware Adaptive Compression For Stream Processing
Data-Aware Adaptive Compression For Stream Processing
Abstract—Stream processing has been in widespread use, and continuously incoming data streams, including sensor data [6]
one of the most common application scenarios is SQL query and financial transactions [7]. Nevertheless, stream processing
on streams. By 2021, the global deployment of IoT endpoints systems grapple with the challenge of escalating data volumes
reached 12.3 billion, indicating a surge in data generation. However,
the escalating demands for high throughput and low latency in as the scale of data streams continues to grow [8]. On one hand,
stream processing systems have posed significant challenges due the network bears significant strain from the sheer volume of
to the increasing data volume and evolving user requirements. stream data, hindering real-time functionality in the presence
We present a compression-based stream processing engine, called of noticeable transmission delays. On the other hand, a high
CompressStreamDB, which enables adaptive fine-grained stream rate of data arrival can overload server memory, as stream
processing directly on compressed streams, to significantly en-
hance the performance of existing stream processing solutions. processing systems temporarily store data in memory. Thus, it
CompressStreamDB utilizes nine diverse compression methods becomes imperative to explore innovative approaches aimed at
tailored for different stream data types and integrates a cost model alleviating the memory and bandwidth pressures confronting
to automatically select the most efficient compression schemes. stream processing systems. Data compression, a conventional
CompressStreamDB provides high throughput with low latency technique for minimizing file sizes [9], [10], [11], [12], [13],
in stream SQL processing by identifying and eliminating redun-
dant data among streams. Our evaluation demonstrates that Com- can enhance the efficiency of stream systems and contribute
pressStreamDB improves average performance by 3.84× and re- to a reduction in storage requirements when applied in stream
duces average delay by 68.0% compared to the state-of-the-art processing scenarios.
stream processing solution for uncompressed streams, along with The utilization of compression in stream processing is piv-
68.7% space savings. Besides, our edge trials show an average otal as it enhances the efficiency of stream systems, offering
throughput/price ratio of 9.95× and a throughput/power ratio of
7.32× compared to the cloud design. potentially three key advantages. First, stream processing often
involves a substantial volume of continuous input data with
Index Terms—Data compaction and compression, stream comparable features, such as timestamps [14], [15], transaction
processing, edge computing.
amounts [7], and sensor values [6]. Notably, up to 30% of the data
may be duplicated [16]. Through data compression, the redun-
I. INTRODUCTION dancy in data can be effectively minimized due to the similarity
HE contemporary era of Big Data witnesses extensive of input streams, thereby reducing the volume of stream data.
T use of stream processing technologies [1], [2], [3], [4].
In 2021, active endpoints reached 12.3 billion, reflecting a
Second, in stream processing scenarios, the overhead from mem-
ory access and network transfer between nodes surpasses that
global 9% increase in connected IoT devices [5]. Notably, low of computation [17]. Our experiments reveal that transmission
latency and real-time are two of the most prominent features of can consume up to 70% of the time with a 500 Mbps network.
stream processing, enabling the analysis and querying of vast, Consequently, it is evident that data compression significantly
enhances the efficiency of stream systems. Third, the proven
utility of direct computing on compressed data extends to data
Manuscript received 7 June 2023; revised 26 February 2024; accepted 4 March
2024. Date of publication 19 March 2024; date of current version 7 August 2024. science applications [18], [19], [20], [21], [22], demonstrating
This work was supported in part by the National Natural Science Foundation of its widespread performance benefits.
China under Grant 62322213 and Grant 62172419, and in part by Beijing Nova However, constructing compressed stream direct processing
Program under Grant 20220484137 and Grant 20230484397. Recommended
for acceptance by S. Salihoglu. (Corresponding author: Feng Zhang.) systems faces three major challenges. First, low latency is crucial
Yu Zhang, Feng Zhang, Hourun Li, Xiaoguang Guo, and Xiaoyong Du are for stream processing systems, but the encoding time required
with the Key Laboratory of Data Engineering and Knowledge Engineering by compression methods often introduces significant delays.
(MOE), School of Information, Renmin University of China, Beijing 100872,
China (e-mail: [email protected]; [email protected]; lihourun@ Experiments in Section II-B reveal that using Gzip may account
ruc.edu.cn; [email protected]; [email protected]). for up to 90.5% of the overall stream processing time for encod-
Shuhao Zhang is with the School of Computer Science and Engineering ing, an unacceptable overhead. Second, the processing queries
(SCSE), Nanyang Technological University (NTU), Singapore 639798 (e-mail:
[email protected]). and input data in stream processing scenarios are dynamic and
Yuxing Chen and Anqun Pan are with the Database R&D Department, subject to modification based on user needs. Some compression
Tencent Inc., Shenzhen 518000, China (e-mail: [email protected]; algorithms exhibit lower time overheads for compression and
[email protected]).
Digital Object Identifier 10.1109/TKDE.2024.3377710 decompression, while others offer higher compression ratios.
1041-4347 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4532 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024
To achieve optimal performance, compression must be adaptive r We introduce a system cost model to guide compressed
to different input workloads, necessitating a careful consider- stream processing and design an adaptive compression
ation of the advantages and disadvantages of each algorithm. algorithm selector based on this model.
Third, decompressing data before executing SQL queries can r We devise a processing approach that directly executes
be time-consuming. In our experiments, the time overhead of SQL queries on compressed streams and conducts com-
decompression compared to query execution ranges from 2.09× prehensive experiments to validate its effectiveness.
to 31.37×. This introduces a potential performance impact due
to the additional time and space requirements for decompression.
We introduce CompressStreamDB, a compression-based II. BACKGROUND
stream processing engine designed to overcome the three chal- A. Stream Processing and Streaming SQL
lenges. First, addressing the need for lightweight and fast com-
pression algorithms to meet low-latency and real-time require- Stream processing: Stream processing, a term in data science,
ments in stream processing, CompressStreamDB integrates nine focuses on the real-time processing of continuous streams of
lightweight compression methods to enhance efficiency across data, events, and messages. It encompasses various systems, in-
various input streams. Moreover, deploying the stream process- cluding reactive systems, dataflow systems, and specific classes
ing system on edge devices brings the processor closer to data of real-time systems [31]. The query is the SQL statement
sources, facilitating accelerated data processing. Second, we in- used for data processing, which can be further subdivided into
troduce a fine-grained adaptive compression algorithm selector different operators. In the stream processing context, a stream
capable of dynamically choosing the compression algorithm that comprises a sequence of tuples, where each tuple represents
provides optimal performance benefits for input streams with an event with elements like timestamp, amounts, and values.
varying features. Our system incorporates a cost model that Tuples collectively form batches, which represent processing
guides the selector’s decisions as the workload shifts, consid- blocks containing a specific number of tuples. Within a batch,
ering properties such as the value range and degree of repetition we use the term column to denote elements of different tuples in
in the input data. This model estimates the time consumed by the same field. Stream processing finds extensive applications
each compression algorithm, enabling the selector to choose the in scenarios requiring minimal latency, real-time response with
most efficient one. Third, we propose a method enabling direct minimal overhead (e.g., risk management [32] and credit fraud
querying of compressed data if the data are aligned in memory, detection [14]), and predictable and approximate results (e.g.,
thereby avoiding decompression costs. This approach applies SQL queries on data streams [15] and click stream analyt-
query operations to compressed data with minimal modification, ics [33]).
if hte data maintain their structure after compression. Addition- Streaming SQL: Among various fields of stream processing,
ally, we view lightweight decompression-required techniques Streaming SQL is one of the emerging hot research topics.
as a specific case, integrable into CompressStreamDB. Our Streaming SQL can be perceived as the streaming version of
preliminary work has been presented in [23]. In this paper, we SQL processing on streams of data, instead of the database.
add new platform, new dataset, new compression algorithm, Traditional SQL queries process the complete set of available
and new evaluation. Specifically, the new idea of applying edge data in the database and generate definite results. In contrast,
devices is valuable compared to the cloud. Edge devices show streaming SQL needs to continuously process the arriving data,
potential in stream processing because they have lower costs and and the result is non-determined and constantly changing. As a
can be deployed close to data sources. We analyze the cost and result, this can raise a number of issues, such as how to reduce the
power benefits of edge devices in detail. response time. Streaming SQL owns declarative nature similar to
We conducted experiments in both cloud and edge envi- SQL, and provides an effective stream processing technology,
ronments, employing four widely-used datasets with varying which largely saves the time and elevates the productivity in
properties. The cloud platform utilized an Intel Xeon Plat- stream data analysis. Besides, many stream systems have been
inum 8269CY 2.5 GHz CPU, while the edge platform em- proposed, such as Apache Storm [34] and Apache Flink [35],
ployed a Raspberry Pi 4B. Our experimental results demonstrate whose relational API is suitable for stream analysis, providing
that CompressStreamDB outperforms the state-of-the-art stream a solid development foundation and productive tools.
processing approach, achieving maximum system efficiency.
CompressStreamDB exhibits a throughput increase of 3.84×
and an average latency reduction of 68.0%. In terms of space sav- B. Compression Algorithms
ings, CompressStreamDB reduces data storage needs by 68.7%. Various compression algorithms have been proposed, but
Furthermore, the edge platform exhibits a throughput/price ratio to ensure accurate query results, our system exclusively con-
that is 9.95× higher than the cloud platform, while its through- siders lossless compression algorithms. Lossless compres-
put/power ratio is 7.32× higher than that of the cloud platform. sion algorithms can be categorized into heavyweight and
Overall, we make the following three major contributions. lightweight compression. Noteworthy heavyweight compres-
r We develop a compressed stream processing engine featur- sion algorithms, such as Lempel-Ziv algorithms [10], [12] and
ing diverse lightweight compression methods applicable Huffman encoding [9], [36], offer high compression ratios but
across various scenarios. involve complex encoding and decoding processes, causing
significant time overhead. Given the real-time and low-latency
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4533
TABLE I
EAGER AND LAZY COMPRESSION METHODS IN LIGHTWEIGHT COMPRESSION
requirements of stream processing, which cannot tolerate pro- Internet of Things, end devices, and user terminals to achieve
longed delays, our exploration of heavyweight compression real-time data processing and response, reducing the pressure
algorithms revealed that while they may achieve higher com- on bandwidth, storage, and computing resources caused by
pression ratios, they also result in longer (de)compression times, centralized computing. It meets the requirements of low latency,
offering limited improvement in the performance and stability high bandwidth, and data security. With the rapid development
of the stream system. In preliminary experiments, we utilized of Internet of Things, cloud computing, Big Data, and other
the commonly used compression tool Gzip in stream processing relevant technologies, edge computing has been widely used in
systems. However, the system with Gzip spent 90.5% of the fields such as smart homes [38], smart cities [39], industrial
total time in compression and less than 10% in transmission. Internet [40], and intelligent transportation [41].
Despite its high compression ratio and low transmission time, the The lightweight development of edge devices is the current
compression time overhead could lead to system delays or even and future trend. With the surge of the mobile Internet, edge com-
pauses. Hence, we advocate the use of lightweight compression puting has extended beyond personal computers and servers to
algorithms to expedite stream processing. encompass mobile devices, including edge computing platforms
Lightweight compression: Lightweight compression repre- based on mobile phones [42]. Following the rapid progress of In-
sents a trade-off between compression ratio and time, employing ternet of Things (IoT) technology, edge computing has expanded
relatively simple encoding methods. In contrast to heavyweight into the realm of low-power and embedded devices, such as
compression algorithms, lightweight alternatives sacrifice some Raspberry Pi 4B [43] and microcontrollers [44]. These devices,
compression ratio for faster (de)compression times. We exam- characterized by smaller size, lower power consumption, and
ined a range of works [11], [13], [24], [25], [26], [27], [28], versatility to run in various environments, support multiple com-
[29], [30] on lightweight compression algorithms, covering most munication protocols and data processing algorithms. They can
commonly used ones. Each algorithm has its unique advantages perform tasks like anomaly detection [45], exoskeletons [46],
and disadvantages, which are appropriate to data streams with voice activation [47], object detection [48], and more. The edge
different characteristics. For instance, Elias Gamma encoding device market is anticipated to grow, driven by the increasing
and Elias Delta encoding [11] are suitable for small and large demand for real-time data processing and analysis, as well
numbers, respectively. Run Length Encoding [26] is effective as the need for low-latency, high-bandwidth, and secure data
for data with more repetition. The effectiveness of Null Sup- transmission. According to IDC’s forecast, the global number of
pression [13] depends on redundant leading zeros in the ele- connected devices is expected to surpass 8 billion by 2025, with
ments. Bitmap and its extensions [28], [29], [30] are suitable for approximately 40% of these devices situated at the edge [49].
compressing data with few distinct values. In stream data processing, edge computing can handle data
Eager and lazy compression: We categorize these lightweight at the point of generation, alleviating the burden of data trans-
compression algorithms into two groups: eager compression and mission and storage. This enables faster real-time analysis of
lazy compression [37]. In Table I, we provide a summary of data. For instance, it finds applications in real-time detection
nine common lightweight compression algorithms with the two systems based on sensors [50], video stream analysis [51],
categories. Eager compression algorithms compress subsets of logistics tracking systems [52], network security detection [53],
input tuples as soon as they arrive, allowing them to process and data processing for autonomous vehicles [48]. Edge devices
each tuple without waiting. On the other hand, lazy compression can efficiently compress and process stream data, meeting the
algorithms wait for the entire data batch before compression. The real-time and accuracy requirements of practical tasks.
advantage of eager algorithms lies in their ability to process each
tuple in real-time, while the advantage of lazy algorithms is their
capacity to leverage the similar redundancy in large datasets, III. MOTIVATION
achieving a higher compression ratio.
A. Problem Definition and Basic Idea
Problem definition: We show the problem definition of pro-
C. Edge Computing
cessing compressed stream as follows. The input data streams
Compared to traditional cloud computing architecture, edge are unbounded sequences of tuples, which are generated from
computing pushes computing resources and services to the the data source. The data block to be processed in the stream
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4534 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4535
highway transmits its location through sensors, which are in collaboration with the server, encompassing data sources like
utilized to calculate tolls based on the specific road section. sensors or smartphones, or intermediate nodes handling data.
Lower tolls incentivize the use of less congested roads. Our Leveraging lightweight compression algorithms, even resource-
solution enables the system to efficiently process substan- constrained devices like data sources can perform compres-
tial volumes of streaming vehicle location data, facilitating sion. Consequently, our system accommodates a multi-layer
more effective decision-making in toll adjustments.
r Cluster management [57] can monitor the execution of architecture with multiple compression client layers, while the
client-server setup represents a simplified model. Compression
computation tasks. The coming data relate to the status functionalities are deployed on the client side. In a distributed ar-
of the cluster, including task submission, state of success chitecture, individual clients perform independent compression
or failure, etc. The anomaly detection with unexpected without coordination. In scenarios where a single query’s input
failures should emit as soon as possible. Our solution can data originates from multiple clients, each client autonomously
provide more rapid response for anomaly detection. determines its compression strategy based on the specific data
Various other real-time stream applications, including man- characteristics. The server manages the processing of queries on
ufacturing equipment detection [58], ship behavior predic- compressed stream data, housing the kernel functions necessary
tion [59], and temporal event sequence detection [60], neces- for executing these queries. It’s important to note that while
sitate efficient stream processing. Fig. 2 illustrates the break- CompressStreamDB is primarily designed for direct processing
down of time utilization in these applications. The complete bar of compressed stream data, it does not dismiss the inclusion
denotes the overall duration of uncompressed stream process- of efficient compression algorithms that require decompression.
ing, with the white segment representing the portion of time They can also be integrated into the system and should not be
consumed by network transmission. Notably, with a 500 Mbps ignored.
bandwidth network, network transmission occupies over 70% Scenario: In a streaming scenario, CompressStreamDB dy-
of the total time. Even on a 1 Gbps network, transmission still namically selects compression algorithms and conducts fine-
accounts for about 50% of the total time. This highlights the grained compression-based stream processing based on speci-
bottleneck created by transmission time in stream applications, fied parameters like network throughput and performance met-
underscoring the critical need for the advantages offered by rics of clients and servers. The compression algorithm selection
compressed stream direct processing. aims to optimize the system’s overall performance, specifically
to minimize the total processing time.
IV. COMPRESSSTREAMDB FRAMEWORK Workflow: After the data are generated in the client of
We propose a fine-grained compressed stream processing CompressStreamDB, the data mainly undergo a series of pro-
framework, called CompressStreamDB, and we show our sys- cesses including compression, transmission, decompression,
tem design in this section. and query, which is also the basis of our proposed system cost
model. Prior to compression, the selector preloads the data and
identifies the compression algorithm that ensures optimal per-
A. Overview
formance. This decision-making process relies on our compre-
CompressStreamDB addresses the challenges mentioned in hensive cost model, considering various factors from machine
Section I, effectively mitigating time and space overhead in metrics and network conditions to the effectiveness and cost
stream processing. It dynamically selects compression algo- of compression algorithms (refer to Section IV-C for details).
rithms and seamlessly integrates them into stream processing. Our system operates at batch granularity, employing distinct
Structure: The CompressStreamDB framework comprises compression algorithms for each data column, as discussed in
two core components: the client and server, depicted in Fig. 3. Section II. Subsequently, the compressed data is transmitted to
The client has a compression algorithm selector based on the the server, where it is processed alongside the corresponding
cost model. This selector is tasked with data collection and SQL queries.
optimal compression algorithm selection. Note that the term Batch: In CompressStreamDB, stream data are processed at
“client” refers to devices seeking compressed stream processing batch granularity. The batch size operates independently from
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4536 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024
the window size in streaming SQL. The window size pertains to larger windows that span multiple batches, reselection impacts
a range concept within SQL, whereas the batch size represents only subsequent batches, without requiring recompression of
the processing granularity of the query engine [1], [14], [15]. previously compressed batches.
It’s worth noting that a batch can be smaller than a window Supported data types: CompressStreamDB not only incor-
or encompass multiple windows. The batch size setting plays a porates lightweight compression algorithms for integers, but
dual role since the growth of batch size can increase both the also supports operations on floating-point numbers and strings.
latency and the compression ratio. We determine the batch size Floating-point items can be converted into integers via multi-
using dynamic sampling, where its overhead can be amortized plying by a factor of 10n [13]. The n here denotes the max-
during stream processing. Users can specify and adjust the batch imum number of decimal places within the data column. For
size based on actual requirements. Experimental insights can be instance, in the context of measured values in smart grids [54],
found in Section VII. the values include numbers such as {3.216, 11.721, 9.8}. With
Flexibility: CompressStreamDB stands as a highly flexible a maximum of 3 decimal places, all data can be scaled by a
system, facilitating not only the support for nine existing data factor of 103 , resulting in {3216, 11721, 9800}. Given that
compression methods but also the seamless integration of ad- data columns typically exhibit closely aligned decimal places,
ditional compression algorithms. This flexibility is designed overflow is uncommon in most scenarios. Overflow risks arise
to effectively address the increasing demand for stream data only when the converted integer exceeds 231 (the limit of a 32-bit
processing. The flexibility of CompressStreamDB empowers it integer). In such cases, we recommend either utilizing a 64-bit
to adeptly handle and analyze diverse data streams, varying in integer representation or exploring the option of employing the
types, scales, and rates. This capability allows the system to dictionary encoding method. For strings, they can be mapped
better align with the demands of real-world tasks. to integers using dictionary encoding, which is a widely-used
Portability: The client of CompressStreamDB is highly method with marginal overhead [61], [62], [63]. Our evaluation
portable, readily adaptable to diverse devices, including embed- in Section VII covers data types of integer, floating-point, and
ded edge devices like Raspberry Pi 4B, requiring minimal mod- string, all of which are encoded as integers before loading. After
ifications. This versatility stems from the lightweight and high- unified data encoding, different types of data can be processed
speed algorithms implemented in the client, demanding min- in CompressStreamDB.
imal computational power. The server of CompressStreamDB Query without decompression: Decompression is employed
is portable and can be deployed to diverse high-performance to restore the original data. CompressStreamDB avoids de-
devices. Its direct SQL operators are universally designed and compression as much as possible, thus reducing time, memory
can be adopted to different compression algorithms and plat- access, and accelerating the query process. In our design, we
forms. The portability of CompressStreamDB empowers its can directly query the compressed data when the compressed
adaptability across diverse devices and platforms, rendering it stream meets the following three conditions. First, the com-
well-suited for resource-constrained environments, notably in pressed data are similar to the data before compression, and
edge computing scenarios. are still structured. Second, the compressed stream data should
be aligned. Third, the compression does not affect the order
of the stream and the process of kernel operation. Our SQL
B. Compressed Stream Processing operators are specially designed for compressed data processing.
Adaptive processing for dynamic workload: In Com- These operators can accept parameters of the number of bytes
pressStreamDB, we dynamically process the input data stream each compressed column occupies. For instance, if the original
using our selector. As detailed in Section II, stream data pro- column holds 4 bytes per element but is compressed to 1 B
cessing is achieved through SQL queries, treating a batch as the per element, our operators handle this column by reading and
minimum processing unit. The system predominantly employs writing only 1 B for each entry. Despite various compression
common relational operators including projection, selection, algorithms encoding raw data differently, their results ultimately
aggregation, group-by, and join. Stream processing is performed conform to a fixed format. As long as the compressed format
through query statements composed of these operators with a meets these three conditions, direct processing of the com-
given size of the sliding window. After a preset number of pressed data is supported. This universal design avoids the
batches, the system dynamically reselects compression algo- complexity of developing separate operator kernels for differ-
rithms for data columns using the system cost model. Com- ent compression methods. Our implementation is portable to
pressStreamDB then scans the next five batches to predict the diverse devices cause it does not require any special hardware
data properties of the follow-up stream, uses the system cost support.
model to calculate latency with the properties, and finally iden- Example: Assume that the stream data includes three
tifies the new processing method with the lowest total processing columns: col1 is 8 bytes, col2 is 4 bytes, and col3 is 4 bytes.
time. Considering that the compression algorithms we use are After compression, col1 is 2 bytes, col2 is 1 B, and col3 is
all lightweight, the overhead of dynamic reselection can be 1 B. A query like “select col1, avg(col2) from data group by
negligible. The batch size and window size in our system are col3” can be mapped to “select col1 , avg(col2 ) from data group
independent of each other. Changes in compression do not by col3 ”. In this way, we only need to update the number of
directly affect the windows. In smaller windows within a batch, bytes to be read for each corresponding column in the operator.
reselecting compression affects multiple windows. However, in The original stream processing operators are mapped to the
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4537
TABLE II com,τ
for each batch. For a chosen compression algorithm τ , Tmemory
SYMBOLS AND MEANINGS
denotes the number of instructions used for memory accesses
com,τ
during compression, while Toperation represents the number of
instructions used for computation. Then the tcompress can be
defined by (2).
tcompress = α · twait
com,τ
com,τ
Tmemory + Toperation SizeT · SizeB
+ max , . (2)
Nclient Bclient
In the following part of this section, we delve into the con- 3) Decompression time: Considering the importance of
siderations of cost model, including machine metrics, network future-proof compression algorithms that require decompres-
conditions, and the efficiency of compression techniques. We sion pre-processing, our system retains the flexibility to incor-
focus on modeling the time consumption across these four porate these methods. For a chosen compression algorithm τ ,
decom,τ
aspects. Tmemory symbolizes the number of instructions executed for
decom,τ
1) Compression time: For the processing batch, SizeT rep- memory access during decompression, while Toperation denotes
resents the number of bytes per tuple, while SizeB means the the count of computational instructions. This allows us to define
number of tuples per batch, thus there are SizeT · SizeB bytes tdecompress as per (6).
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4538 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024
first encoding [11]. Given a positive number x, let L = log2 x, design, with values ranging between 1 and 4 bytes, we use two
and N = log2 (L + 1). The Elias Delta encoding form of x has bits to signify the byte count per value. Consequently, every
1) N zero bits, followed by 2) N + 1 bit binary representation group of four elements requires an extra byte to record their
of L + 1 and 3) the last L bits of the binary form of x. Hence, it lengths.
uses 2 ∗ N + L + 1 bits to represent the number x. Compared with null suppression with fixed length, null sup-
Same as Elias Gamma encoding, we extend Elias Delta encod- pression with variable length has a better compression ratio in
ing to make it suitable for non-negative integers. In addition, we most cases. It has a good effect on elements of different sizes.
have also aligned its encoding results. For a given data column, However, when column elements predominantly fall within a
We use EDDomain to represent the maximum number of narrow range, the additional bytes used to note their lengths can
bytes required for Elias Delta encoding for the elements of this become an overhead that’s difficult to overlook.
column. Then, the data in this column are all EDDomin bytes Utilizing the introduced V alueDomain array in NS, the total
during processing. It requires more bits for compressing small number of bytes needed after compression can be derived by
values compared to Elias Gamma encoding, but performs better summing the values within V alueDomain. The compression
with larger values. For larger integers, the length of Elias Delta ratio r for null suppression with variable length is determined
encoding approaches entropy, making it nearly optimal. The by (13).
compression ratio r for Elias Delta encoding can be expressed
SizeC · SizeB
using (11). r= . (13)
SizeB /4 + Size
i=1
B
V alueDomaini
SizeC
r= . (11) Because the compressed elements are not byte-aligned, they
EDDomain
have to be decompressed before processing. We have its param-
SizeC
Its parameters are: β = 0, r = r = . eters: β = 1, r = 1.
r Pros: 1) It is more stable than Elias Gamma encoding. 2)
EDDomain r Pros: 1) It can make better use of space and achieve a
It can handle a larger range of values. higher compression ratio compared to NS. 2) It can handle
r Cons: 1) Its compression process is more complicated, so the situation of large data changes and is not easily affected
the compression speed can be slower. 2) When the value is by outliers.
r Cons: 1) It needs decompression before processing. 2) It
very small, its performance is not as good as Elias Gamma
encoding. needs extra bytes to record the length.
Null suppression with fixed length (NS): The null suppression
with fixed length method removes leading zeros from the binary C. Lazy Compression
representation of the element, efficiently eliminating the redun- Lazy compression algorithms wait until the entire input batch
dancy caused by the data type [24]. “With fixed length” means arrives, and then compress the whole batch. Hence, their α value
that the elements of the compressed data have the same number in the system cost model is 1.
of bits. Base-Delta encoding (BD): Base-Delta encoding is ideal for
To estimate the compression effects of null suppression with scenarios with large values and a limited data range, or when dif-
fixed length and null suppression with variable length, we in- ferences between values are considerably smaller than the values
troduced the V alueDomain array. The size of this array is themselves [25]. This method selects a base value from a series
the same as the batch size. It records the number of bytes of values and stores it. Each element is represented by its delta
required to represent the valid bits of each element in the column. value from this base. If the delta value is significantly smaller
V alueDomainM AX denotes the bytes used by elements after than the original element, it can be efficiently represented with
null suppression with fixed-length. The compression ratio r for fewer bytes.
this method is given by (12). We designate BDDomain to represent the maximum bytes
SizeC needed for Base-Delta encoding in a data column. The compres-
r= . (12) sion ratio r for Base-Delta encoding can be calculated using (14).
V alueDomainM AX
Similarly, its parameters are: β = 0, r = r =
SizeC SizeC
. r= . (14)
V alueDomainM AX BDDomain
r Pros: Its compression is very convenient and can be
It can avoid decompression in CompressStreamDB. Then, we
performed efficiently. SizeC
r Cons: It is not good at handling large outliers. have its parameters: β = 0, r = r = .
Null suppression with variable length (NSV): Null suppres- r Pros: It achieves fast compression due to its reliance on
BDDomain
sion with variable length is similar to null suppression with fixed basic vector addition and subtraction operations.
length, achieving compression by removing leading zeros [24]. r Cons: It is only suitable for data with a small range of
Unlike the fixed-length method, it doesn’t mandate a consistent variation.
number of bits in the compressed output. Instead, it records the Run length encoding (RLE): Run Length Encoding is a clas-
byte length of each encoded value for decompression. In our sical compression algorithm [26] effective for datasets featuring
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4540 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024
recurring sequences of elements. It efficiently reduces space by thereby achieving a higher compression ratio. Moreover, the
compressing repetitive data that occurs periodically. first different value after this sequence will also be compressed
Suppose the average run length of a column of data batch is into the same element. In other words, PLWAH can merge all
represented by AverageRunLength. As run-length encoding elements filled with 0 or 1.
requires an additional integer variable (4 bytes) to represent the Example: We use a simplified 8-bit example to illustrate
run length, the compression ratio r for run-length encoding is the compression scheme of PLWAH. Assume that we have 4
defined in (15). Bitmap entries to compress and the original data is [00000000,
SizeC · AverageRunLength 00000000, 00000000, 00100000]. Then the compressed data is
r= . (15) an 8-bit entry [1 0 011 011]. The first 1 indicates that this word is
SizeC + 4
a compressed fill word. The second 0 indicates that the fill word
RLE does not maintain byte alignment and disrupts the origi- is filled with 0. The next three digits “011” indicate a literal word
nal data structure, necessitating decompression before process- with “1” in the third digit (00100000). The last three digits “011”
ing. Consequently, its parameters are defined as follows: β = 1, indicate three consecutive 0 fill words. In this way, the original
r = 1.
r Pros: 1) Its compression speed is relatively fast. 2) four items are compressed into one item. In the context of a
32-bit representation PLWAH, the first 2 bits signify the word
It can achieve a high compression ratio for data if type, the subsequent 5 bits denote the position of the next literal
AverageRunLength is high.
r Cons: 1) It needs decompression before processing the word’s “1”, and the final 25 bits indicate the number of merged
fill words [28]. Note that our bitmap approach is designed to
data. 2) It only applies to continuously repeated data. compress data with specific types of values. The compressed
Dictionary (DICT): The dictionary compression algorithm bitmap can only have at most one “1”, with the remaining bits
is commonly used to convert larger data into smaller data by set to “0” [13]. PLWAH can be applied in such scenarios.
establishing a one-to-one relationship [27]. If the number of In practice, the most frequently occurring element can be
data types is denoted as Kindnum, the compression ratio r of mapped to the element filled with 0 in the bitmap, and the second
dictionary encoding can be defined using (16). most frequently occurring element can be mapped to the element
SizeC filled with 1 in the bitmap. This strategic mapping significantly
r= . (16)
log2 Kindnum/8 reduces space allocation for these two elements in the entire
It is byte-aligned and structured, so it can avoid decom- column. If we denote the count of the most frequent element as
pression. Accordingly, its parameters are: β = 0, r = r = M ostCount, and the count of the second most frequent element
SizeC as SecondCount, then the compression ratio r of PLWAH is
. formally defined in (18).
log2 Kindnum/8
r Pros: It has a relatively high compression ratio.
r Cons: It is appropriate for use when there are only a few SizeB
r= . (18)
types of data. SizeB − M ostCount − SecondCount
Bitmap: The bitmap compression algorithm is relatively con-
cise, using a bit string to represent the original data [13], [28], It destroys the data structure of the original data, so: β = 1,
[29], [30]. Each bit in the bit string corresponds to a unique ele- r = 1.
r Pros: It has much higher compression ratio than bitmap.
ment in the original data. If the number of data types is denoted
r Cons: It is appropriate for use when there are only a few
as Kindnum, the compression ratio r of bitmap encoding can
be defined using (17). types of data.
SizeC
r= . (17)
2log2 Kindnum) /8 VI. IMPLEMENTATION
It destroys the data structure of the original data, so: β = 1, We implement CompressStreamDB with references to [14],
r = 1. [15], [65]. It comprises two primary modules: a client module
r Pros: It has fast compression and decompression speed. housing the stream processing compression algorithms and the
r Cons: It is appropriate for use when there are only a few adaptive selector, and a server module equipped with funda-
types of data. mental SQL operators to process compressed streams. These
Position list word aligned hybrid (PLWAH): To demonstrate operators include selection, projection, groupby, aggregation,
the flexibility of CompressStreamDB in compression algo- and join. The server module takes charge of handling these
rithms, besides the original compression algorithms, we further compressed streams efficiently. Additionally, we integrate a
extend a highly efficient variant of compressed Bitmap, PLWAH, profiler into the server, facilitating the collection of key per-
into our system. The evaluation indicates that PLWAH further formance metrics, including (de)compression and transmission
improves the performance of our system, detailed in Section VII. times. It’s important to note that the compression functionality
PLWAH is an efficient compressed bitmap data structure with within CompressStreamDB can be turned off, allowing sup-
a high compression ratio and fast operation capability [28]. port for processing uncompressed streams. In scenarios involv-
When a sequence of elements filled with 1 s or 0 s appears, ing small-window queries, like handling a single tuple, Com-
the PLWAH algorithm compresses them into a single element, pressStreamDB can seamlessly execute uncompressed stream
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4541
processing without waiting for the entire data batch. This hy- 20.04.3 LTS with Java 8. Our server is deployed on both a cloud
brid processing mode significantly extends the applicability of platform and an edge platform. Their information is as follows.
system across a broad spectrum of stream processing scenarios. r Cloud platform. It has the same configuration as the client.
In our batch implementation, a sliding window can extend The price of the CPU is about $899. Its TDP is 205 W,
across multiple batches. To address this challenge, our system which indicates that it cannot be used at the edge because
incorporates a batch buffer, temporarily storing data from the of high power consumption. We mainly conduct system
previous batch. Upon detecting a sliding window that spans performance experiments on this platform, and use the
across batches, the system awaits the arrival of the subsequent turbostat tool to monitor its power.
batch. At this point, it retrieves the previously cached batch from r Edge platform. Our edge device is Raspberry Pi 4 Model
the buffer, facilitating the computation of results across sliding B [68] . It is equipped with Quad core ARM Cortex-A72
windows that cross batch boundaries. 64-bit SoC and 8 GB memory, running Raspberry Pi OS
CompressStreamDB can be deployed among different envi- 5.10 with Java 8. Its price is $75, with maximum 6.4 W
ronments. For example, Apache Storm can customize serializer. power consumption [69] . We use the edge device to prove
We can wrap the compression module of CompressStreamDB the portability of our system, and show the advantages
into a custom serializer, and then embed it into Storm for use. of using edge device for stream processing. To detect
However, this incurs additional challenges such as model inte- the Raspberry Pi’s power consumption, a power meter is
gration and Storm internal implementation overhead, especially attached.
in distributed environments. Because our work focuses on the Datasets: Our evaluation incorporates four datasets widely
adaptive selection of compressions in stream processing, we used in previous studies [14], [15], [61], [70], [71], [72], [73],
leave the adaptation to other systems as future work. [74], [75], all of which remain relevant in current discussions.
For instance, the smart home market is projected to reach 51.23
VII. EVALUATION billion by 2026, growing at an annual rate of 11.7% [55],
emphasizing the continued significance of these datasets. The
A. Experimental Setup first dataset originates from energy consumption measurements
Methodology: The baseline for comparison is Com- in smart grids [54], capturing data from various devices within
pressStreamDB without compression. Our system offers the a smart grid to enable load predictions and real-time demand
ability to disable the compression function, allowing uncom- management in energy consumption. The second dataset, com-
pressed stream processing. The comparison against this baseline pute cluster monitoring [57], is derived from a Google cluster,
aims to evaluate whether our solution enhances performance in simulating a cluster management scenario. The third dataset,
stream systems. To better demonstrate the benefits of our adap- the linear road benchmark [56], records vehicle position events
tive compression approach, we conducted a performance com- and models a network of toll roads. The fourth dataset is Star
parison between our implementation and two high-performance Schema Benchmark (SSB) [76]. It contains one fact table, four
stream processing systems–Saber [14] and FineStream [15]. dimension tables, and thirteen standard queries. We adjust SSB
Notably, both Saber and FineStream operate without employing for stream processing. The adaption for more benchmarks are
compression techniques. We implement the Base-Delta encod- shown in Section VII-E.
ing compression with reference to TerseCades [66], denoted as Queries: We utilize eight queries to evaluate the perfor-
“Base-Delta”. TerseCades stands as a pioneering exploration mance of adaptive compression in CompressStreamDB. For
of stream processing with data compression, demonstrating its each dataset, we execute two queries to derive performance
efficacy in this domain. Our work showcases progressive ad- metrics, evaluating various processing methods including the
vancements in performance compared to the Base-Delta encod- baseline, nine lightweight compression algorithms, and Com-
ing employed in TerseCades. As TerseCades isn’t open-source, pressStreamDB. These queries are well-established in prior
we re-implement its functionalities. Comparing the result of stream processing studies [14], [15], [72], [73], [74], [75]. The
our implementation with those presented in the TerseCades pa- specific details of the eight queries are outlined in Table III. Q1
per [66], we observe similar outcomes. For instance, in [66], the and Q2 analyze the anomaly detection in smart grids dataset. Q3
system processed queries on the Pingmesh Data [67] achieving a and Q4 operate on the linear road benchmark dataset. Queries
throughput of 37.5MElems/s. In our implementation using Base- Q5 and Q6 interact with the Google compute cluster monitoring
Delta encoding, we accomplish a throughput of 37.2MElems/s data. Q7 and Q8 tackle the Star Schema Benchmark. To adapt
when querying the Smart Grid Data [54], showing a compa- to stream processing; we rewrite Q1.1 and Q1.2 of SSB to adapt
rable performance level. Our study extends further by com- to stream processing.
paring the adaptive compression stream processing capability In the Smart Grid and Linear Road Benchmark datasets, a
of CompressStreamDB across nine lightweight compression batch encompasses 100 windows, and each window contains
algorithms. To exhibit the portability of CompressStreamDB, 1024 tuples. In the case of the Cluster Monitoring dataset, a batch
we conduct experiments on both cloud and edge platforms, comprises 200 windows, with each window consisting of 512
analyzing and comparing their throughput, power efficiency, and tuples. Finally, within the Star Schema Benchmark dataset, each
cost efficiency. batch contains 100 windows, and each window encompasses 512
Platforms: Our client is equipped with an Intel Xeon Platinum tuples. The performance result for each dataset is the average of
8269CY 2.5 GHz CPU and 16 GB memory, running Ubuntu the results of related queries.
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4542 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024
TABLE III
QUERIES USED IN EVALUATION
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4543
Fig. 5. Latency of different compression methods. Fig. 7. Time breakdown of compression and decompression. CmpStr is short
for CompressStreamDB.
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4544 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024
TABLE IV
RELATIONS BETWEEN TIME AND COMPRESSION
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4545
Fig. 12. Accuracy of the cost model. CmpStr is short for CompressStreamDB.
Fig. 10. Comparison of throughput/price ratio on cloud and edge device.
for illustration, as shown in Fig. 12. The dashed line and the solid
line show the estimated time and measured time respectively.
throughput to price (USD), and the throughput/power ratio,
All estimated values are slightly less than the actual values
denoting the ratio of throughput to power (Watt).
because of additional overhead caused by the system operation.
Throughput/price ratio: Fig. 10 shows the throughput/price
On average, the system cost model within CompressStreamDB
ratios on the cloud server and edge device. In the Smart Grid
achieves an accuracy of 89.2%. This level of accuracy suggests
dataset, the throughput/price ratio of the edge CPU surpasses
that the cost model is reliable and suitable for estimating the cost
that of the cloud CPU by 10.81×, and for the remaining three
associated with stream processing integrated with compression
datasets, the ratios stand at 8.90×, 10.91×, and 9.18× respec-
techniques.
tively. On average, the edge device exhibits a 9.95× higher
Batch size: As outlined in Section IV-A, the batch size is
throughput/price ratio compared to the cloud, showcasing a pro-
a factor influencing both latency and compression ratio. Us-
nounced cost advantage of utilizing edge devices. These findings
ing the Smart Grid workload as an illustration, where each
affirm the cost-effectiveness inherent in the design purpose of
window contains 1,024 tuples. We depict their interrelation in
the Raspberry Pi, known for its economical utility.
Fig. 13, considering three distinct network settings: 1 Gbps
Throughput/power ratio: In Fig. 11, the throughput/power
network, 100 Mbps network, and a single node without net-
ratio for both cloud and edge platforms is presented. Notably,
work transmission. Our observations are as follows. First, at
the average power consumption of the cloud CPU stands at
100 Mbps, we observe a notable rise in latency corresponding
33.5 W, whereas the edge device operates at an average power
to larger batch sizes. Conversely, within the 1 Gbps network and
of 3.8 W. In the Smart Grid dataset, the throughput/power ratio
single-node mode, batch size exhibits comparatively minimal
on the edge surpasses that of the cloud by 7.95×, while for
impact on system latency. This divergence arises from the con-
the other three datasets, these ratios are 6.55×, 8.03×, and
straints imposed by limited network bandwidth, leading to data
6.75× respectively. On average, the edge platform demonstrates
queuing before transmission. Consequently, larger batch sizes
a 7.32× higher throughput/power ratio compared to the cloud.
can induce system pauses. Second, as batch size increases, the
These findings underscore the potential for significant energy
space occupancy decreases. This phenomenon is attributed to the
savings by leveraging edge devices compared to cloud-based
improved utilization of data redundancy with larger batch sizes.
operations.
Third, the absence of an optimal batch size suggests that its de-
termination requires specific situational analysis. Furthermore,
E. Design Tradeoffs and Discussion
we conducted measurements on cross-batch sliding windows,
Model accuracy: We verify the accuracy of our system cost varying the window slide size within the range {1, 128, 256, 512,
model in this part. We use the example of the Smart Grid dataset 1024} across different network settings. We observed nearly
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4546 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024
identical performance levels with minimal fluctuation (less than list compression in memory. Deliege and Pedersen [28] opti-
2%). This consistency is attributed to the ability of our batch mized space and performance for bitmap compression. Wang et
buffer to retain critical data without imposing additional strain al. [29] conducted an experimental study between bitmap and
on network transmission. inverted list compressions. Fang et al. [13] analyzed common
Performance on additional benchmark: CompressStreamDB compression algorithms, while Przymus and Kaczmarski [64]
can adapt to various benchmarks. We evaluate it on the TPC- explored how to select an optimal compression method for time
H benchmark to further demonstrate its potential. The TPC- series databases. Sprintz [91] introduces a four part composite
H benchmark contains one fact table and seven dimension compression algorithm for time-series data. Many works [86],
tables, with twenty-two standard queries. We adjust TPC-H [87] used hardware such as GPU and FPGA to optimize
and rewrite its queries for stream processing. We use the Q1, data compression. As for processing directly on compressed
Q2, and Q3 of its standard queries for evaluation. On aver- data [18], [19], [20], [21], [92], [93], [94], this technology can
age, CompressStreamDB achieves 2.18× throughput compared provide efficient storage and retrieval of data. For example, Chen
to our baseline, while the best single compression algorithm et al. [94] proposed a memory-efficient optimization approach
BD achieves 1.61× throughput. In terms of latency, Com- for large graph analytics, which compresses the intermediate
pressStreamDB can reduce the latency by 53.2% compared to vertex information with Huffman coding or bitmap coding and
baseline, by 26.2% compared to BD. queries on the partially decoded data or directly on the com-
Impact of parallelism: To investigate the impact of parallelism pressed data. Li et al. [95], [96], [97] presented compression
on performance, we conduct parallel throughput experiments on methods for very large databases, with aggregation operating
the cloud platform. On average, the parallel version achieves a directly on compressed datasets. Succinct [20] enables efficient
throughput improvement of 2.10% compared to the single-core queries directly on a compressed representation of data. Other
version. Although our cloud platform is equipped with an Intel works [18], [19], [98], [99] focused on the direct processing of
Xeon Platinum 8269CY CPU and supports 52 threads, the per- other compressed storage structures such as graphs. Different
formance improvement of parallelism is limited due to network from these studies, our work is the first fine-grained stream
transmission being a key factor affecting performance in our processing engine that can query compressed streams without
experimental network environment. decompression.
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4547
[6] W. Wingerath et al., “Real-time stream processing for Big Data,” Inf. [37] S. Zhang et al., “Parallelizing intra-window join on multicores: An exper-
Technol., vol. 58, pp. 186–194, 2016. imental study,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2021,
[7] B. Gedik et al., “SPADE: The system S declarative stream process- pp. 2089–2101.
ing engine,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2008, [38] H. Yar et al., “Towards smart home automation using IoT-enabled edge-
pp. 1123–1134. computing paradigm,” Sensors, vol. 21, p. 4932, 2021.
[8] M. Hirzel et al., “Stream processing languages in the Big Data era,” ACM [39] Z. Lv et al., “Intelligent edge computing based on machine learning
SIGMOD Rec., vol. 47, pp. 29–40, 2018. for smart city,” Future Gener. Comput. Syst., vol. 115, pp. 90–99,
[9] P. Deutsch et al., “GZIP file format specification version 4.3,” 1996. 2021.
[10] J. Ziv and A. Lempel, “A universal algorithm for sequential data compres- [40] E. Sisinni, A. Saifullah, S. Han, U. Jennehag, and M. Gidlund, “Industrial
sion,” IEEE Trans. Inf. Theory, vol. IT-23, no. 3, pp. 337–343, May 1977. Internet of Things: Challenges, opportunities, and directions,” IEEE Trans.
[11] P. Elias, “Universal codeword sets and representations of the integers,” Ind. Informat., vol. 14, no. 11, pp. 4724–4734, Nov. 2018.
IEEE Trans. Inf. Theory, vol. IT-21, no. 2, pp. 194–203, Mar. 1975. [41] J. Zhang, F.-Y. Wang, K. Wang, W. -H. Lin, X. Xu, and C. Chen, “Data-
[12] J. Ziv and A. Lempel, “Compression of individual sequences via variable- driven intelligent transportation systems: A survey,” IEEE Trans. Intell.
rate coding,” IEEE Trans. Inf. Theory, vol. IT-24, no. 5, pp. 530–536, Transp. Syst., vol. 12, no. 4, pp. 1624–1639, Dec. 2011.
Sep. 1978. [42] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile
[13] W. Fang, B. He, and Q. Luo, “Database compression on graphics proces- edge computing: The communication perspective,” IEEE Commun. Surv.
sors,” Proc. VLDB Endowment, vol. 3, pp. 670–680, 2010. Tuts., vol. 19, no. 4, pp. 2322–2358, Fourth Quarter 2017.
[14] A. Koliousis et al., “SABER: Window-based hybrid stream processing for [43] J. Chen and X. Ran, “Deep learning with edge computing: A review,”
heterogeneous architectures,” in Proc. ACM SIGMOD Int. Conf. Manage. Proc. IEEE, vol. 107, no. 8, pp. 1655–1674, Aug. 2019.
Data, 2016, pp. 555–569. [44] C.-H. Chen, M.-Y. Lin, and C.-C. Liu, “Edge computing gateway of the in-
[15] F. Zhang et al., “FineStream: Fine-grained window-based stream process- dustrial Internet of Things using multiple collaborative microcontrollers,”
ing on CPU-GPU integrated architectures,” in Proc. USENIX Conf. Usenix IEEE Netw., vol. 32, no. 1, pp. 24–32, Jan./Feb. 2018.
Annu. Tech. Conf., 2020, Art. no. 43. [45] B. Hussain, Q. Du, S. Zhang, A. Imran, and M. A. Imran,
[16] A third of the internet is just a copy of itself, 2013. [Online]. Available: “Mobile edge computing-based data-driven deep learning framework
https://www.businessinsider.com/ for anomaly detection,” IEEE Access, vol. 7, pp. 137656–137667,
[17] P. R. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, and 2019.
M. Seltzer, “Network-aware operator placement for stream-processing [46] S. Rajesh, V. Paul, V. G. Menon, S. Jacob, and P. Vinod, “Secure brain-
systems,” in Proc. 22nd Int. Conf. Data Eng., 2006, pp. 49–49. to-brain communication with edge computing for assisting post-stroke
[18] F. Zhang, J. Zhai, X. Shen, O. Mutlu, and X. Du, “Enabling efficient paralyzed patients,” IEEE Internet Things J., vol. 7, no. 4, pp. 2531–2538,
random access to hierarchically-compressed data,” in Proc. IEEE 36th Apr. 2020.
Int. Conf. Data Eng., 2020, pp. 1069–1080. [47] S. Aggarwal and S. Sharma, “Voice based deep learning enabled user
[19] F. Zhang et al., “Efficient document analytics on compressed data: Method, interface design for smart home application system,” in Proc. 2nd Int.
challenges, algorithms, insights,” Proc. VLDB Endowment, vol. 11, Conf. Commun. Comput. Ind. 4.0, 2021, pp. 1–6.
pp. 1522–1535, 2018. [48] S. Liu, L. Liu, J. Tang, B. Yu, Y. Wang, and W. Shi, “Edge computing for
[20] R. Agarwal, A. Khandelwal, and I. Stoica, “Succinct: Enabling queries autonomous driving: Opportunities and challenges,” Proc. IEEE, vol. 107,
on compressed data,” in Proc. 12th USENIX Conf. Netw. Syst. Des. no. 8, pp. 1697–1716, Aug. 2019.
Implementation, 2015, pp. 337–350. [49] A. Zilberman and L. Ice, “Why computer occupations are behind strong
[21] A. Khandelwal, “Queries on compressed data,” University of California, stem employment growth in the 2019–29 decade,” Computer, vol. 4, no. 5,
Berkeley, 2019. pp. 11–5, 2021.
[22] F. Zhang et al., “CompressDB: Enabling efficient compressed data direct [50] D. Park et al., “LiReD: A light-weight real-time fault detection system for
processing for various databases,” in Proc. ACM SIGMOD Int. Conf. edge computing using LSTM recurrent neural networks,” Sensors, vol. 18,
Manage. Data, 2022, pp. 1655–1669. p. 2110, 2018.
[23] Y. Zhang, F. Zhang, H. Li, S. Zhang, and X. Du, “CompressStreamDB: [51] X. Jiang, F. R. Yu, T. Song, and V. C. M. Leung, “A survey on multi-access
Fine-grained adaptive stream processing without decompression,” in Proc. edge computing applied to video streaming: Some research issues and
IEEE 39th Int. Conf. Data Eng., 2023, pp. 408–422. challenges,” IEEE Commun. Surveys Tuts., vol. 23, no. 2, pp. 871–903,
[24] P. A. Alsberg, “Space and time savings through large data base compres- Second Quarter 2021.
sion and dynamic restructuring,” Proc. IEEE, vol. 63, no. 8, pp. 1114–1122, [52] Z. Zhao et al., “IoT edge computing-enabled collaborative tracking system
Aug. 1975. for manufacturing resources in industrial park,” Adv. Eng. Inform., vol. 43,
[25] G. Pekhimenko, V. Seshadri, O. Mutlu, M. A. Kozuch, P. B. Gibbons, and T. 2020, Art. no. 101044.
C. Mowry, “Base-delta-immediate compression: Practical data compres- [53] P. Ranaweera, A. D. Jurcut, and M. Liyanage, “Survey on multi-access
sion for on-chip caches,” in Proc. 21st Int. Conf. Parallel Architectures edge computing security and privacy,” IEEE Commun. Surveys Tuts.,
Compilation Techn., 2012, pp. 377–388. vol. 23, no. 2, pp. 1078–1124, Second Quarter 2021.
[26] D. Abadi, S. Madden, and M. Ferreira, “Integrating compression and [54] H. Ziekow and Z. Jerzak, “The DEBS 2014 grand challenge,” in Proc. 8th
execution in column-oriented database systems,” in Proc. ACM SIGMOD ACM Int. Conf. Distrib. Event-Based Syst., 2014, pp. 266–269.
Int. Conf. Manage. Data, 2006, pp. 671–682. [55] Smart home statistics, 2021. [Online]. Available: https://www.statista.
[27] M. A. Roth and S. J. Van Horn, “Database compression,” ACM SIGMOD com/outlook/dmo/smart-home/united-states
Rec., vol. 22, pp. 31–39, 1993. [56] A. Arasu et al., “Linear road: A stream data management benchmark,” in
[28] F. Deliège and T. B. Pedersen, “Position list word aligned hybrid: Opti- Proc. 30th Int. Conf. Very Large Data Bases, 2004, pp. 480–491.
mizing space and performance for compressed bitmaps,” in Proc. 13th Int. [57] More Google cluster data, 2011. [Online]. Available: https://ai.googleblog.
Conf. Extending Database Technol., 2010, pp. 228–239. com/2011/11/more-google-cluster-data.html
[29] J. Wang et al., “An experimental study of bitmap compression vs. inverted [58] V. Gulisano et al., “The DEBS 2017 grand challenge,” in Proc. 11th ACM
list compression,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2017, Int. Conf. Distrib. Event-Based Syst., 2017, pp. 271–273.
pp. 993–1008. [59] V. Gulisano et al., “The DEBS 2018 grand challenge,” in Proc. 12th ACM
[30] J. Wang et al., “MILC: Inverted list compression in memory,” Proc. VLDB Int. Conf. Distrib. Event-Based Syst., 2018, pp. 191–194.
Endowment, vol. 10, pp. 853–864, 2017. [60] C. Mutschler, H. Ziekow, and Z. Jerzak, “The DEBS 2013 grand chal-
[31] R. Stephens, “A survey of stream processing,” Acta Inform., vol. 34, lenge,” in Proc. 7th ACM Int. Conf. Distrib. Event-Based Syst., 2013,
pp. 491–541, 1997. pp. 289–294.
[32] P. Córdova, “Analysis of real time stream processing systems considering [61] A. Shanbhag, S. Madden, and X. Yu, “A study of the fundamental perfor-
latency,” University of Toronto, 2015. mance characteristics of GPUs and CPUs for database analytics,” in Proc.
[33] M. H. Ali et al., “Microsoft CEP server and online behavioral targeting,” ACM SIGMOD Int. Conf. Manage. Data, 2020, pp. 1617–1632.
Proc. VLDB Endowment, vol. 2, pp. 1558–1561, 2009. [62] T. Neumann, “Efficiently compiling efficient query plans for modern
[34] Apache storm, 2021. [Online]. Available: http://storm.apache.org hardware,” Proc. VLDB Endowment, vol. 4, pp. 539–550, 2011.
[35] Apache flink, 2021. [Online]. Available: http://flink.apache.org [63] P. A. Boncz, M. Zukowski, and N. Nes, “MonetDB/X100: Hyper-
[36] D. A. Huffman, “A method for the construction of minimum-redundancy pipelining query execution,” in Proc. Conf. Innov. Data Syst. Res., 2005,
codes,” Proc. IEEE, vol. 40, no. 9, pp. 1098–1101, Sep. 1952. pp. 225–237.
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4548 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024
[64] P. Przymus and K. Kaczmarski, “Compression planner for time series [92] Z. Pan et al., “Exploring data analytics without decompression on em-
database with GPU support,” in Transactions on Large-Scale Data-and bedded GPU systems,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 7,
Knowledge-Centered Systems XV, Berlin, Germany: Springer, 2014. pp. 1553–1568, Jul. 2022.
[65] S. Zhang et al., “BriskStream: Scaling data stream processing on shared- [93] F. Zhang et al., “TADOC: Text analytics directly on compression,” VLDB
memory multicore architectures,” in Proc. ACM SIGMOD Int. Conf. J., vol. 30, pp. 163–188, 2021.
Manage. Data, 2019, pp. 705–722. [94] X. Chen et al., “HBMax: Optimizing memory efficiency for parallel influ-
[66] G. Pekhimenko et al., “Tersecades: Efficient data compression in stream ence maximization on multicore architectures,” 2022, arXiv:2208.00613.
processing,” in Proc. USENIX Conf. Usenix Annu. Tech. Conf., 2018, [95] J. Li, D. Rotem, and H. K. Wong, “A new compression method with fast
pp. 307–320. searching on large databases,” in Proc. 13th Int. Conf. Very Large Data
[67] C. Guo et al., “Pingmesh: A large-scale system for data center network Bases, 1987, pp. 311–318.
latency measurement and analysis,” in Proc. ACM Conf. Special Int. Group [96] J. Li, D. Rotem, and J. Srivastava, “Aggregation algorithms for very large
Data Commun., 2015, pp. 139–152. compressed data warehouses,” in Proc. 25th Int. Conf. Very Large Data
[68] Raspberry pi 4 model b, 2022. [Online]. Available: https://www. Bases, 1999, pp. 651–662.
raspberrypi.com/products/raspberry-pi-4-model-b/ [97] J. Li and J. Srivastava, “Efficient aggregation algorithms for compressed
[69] Raspberry pi dramble: Power consumption benchmarks, 2022. [On- data warehouses,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 3, pp. 515–
line]. Available: https://www.pidramble.com/wiki/benchmarks/power- 529, May/Jun. 2002.
consumption [98] W. Fan et al., “Query preserving graph compression,” in Proc. ACM
[70] R. Castro Fernandez et al., “Integrating scale out and fault tolerance SIGMOD Int. Conf. Manage. Data, 2012, pp. 157–168.
in stream processing using operator state management,” in Proc. ACM [99] H. Maserrat and J. Pei, “Neighbor query friendly compression of social
SIGMOD Int. Conf. Manage. Data, 2013, pp. 725–736. networks,” in Proc. 16th ACM SIGKDD Int. Conf. Knowl. Discov. Data
[71] I. S. Moreno, P. Garraghan, P. Townend, and J. Xu, “Analysis, modeling Mining, 2010, pp. 533–542.
and simulation of workload patterns in a large-scale utility cloud,” IEEE
Trans. Cloud Comput., vol. 2, no. 2, pp. 208–221, Second Quarter 2014.
[72] H. Funke et al., “Pipelined query processing in coprocessor environments,”
in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2018, pp. 1603–1618. Yu Zhang received the bachelor’s degree from the
[73] J. Li et al., “HippogriffDB: Balancing I/O and GPU bandwidth in Big Data Department of Computer Science and Technology,
analytics,” Proc. VLDB Endowment, vol. 9, pp. 1647–1658, 2016. Tsinghua University, in 2021. He is currently working
[74] K. Wang et al., “Concurrent analytical query processing with GPUs,” Proc. toward the PhD degree with DEKE Lab and School
VLDB Endowment, vol. 7, pp. 1011–1022, 2014. of Information, Renmin University of China. His
[75] Y. Yuan, R. Lee, and X. Zhang, “The Yin and Yang of processing data major research interests include database systems and
warehousing queries on GPU devices,” Proc. VLDB Endowment, vol. 6, parallel computing.
pp. 817–828, 2013.
[76] P. O’Neil et al., “The star schema benchmark and augmented fact table
indexing,” in Proc. Technol. Conf. Perform. Eval. Benchmarking, 2009,
pp. 237–252.
[77] X. Ren et al., “LDP-IDS: Local differential privacy for infinite data
streams,” 2022, arXiv:2204.00526.
[78] B. Zhao et al., “EIRES: Efficient integration of remote data in event stream Feng Zhang received the bachelor’s degree from
processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2021, Xidian University, in 2012, and the PhD degree in
pp. 2128–2141. computer science from Tsinghua University, in 2017.
[79] T. Li et al., “Trace: Real-time compression of streaming trajectories in He is a professor with DEKE Lab and School of
road networks,” Proc. VLDB Endowment, vol. 14, pp. 1175–1187, 2021. Information, Renmin University of China. His ma-
[80] Y. Zhou, A. Salehi, and K. Aberer, “Scalable delivery of stream query jor research interests include database systems, and
result,” in Proc. VLDB Endowment, vol. 2, pp. 49–60, 2009. parallel and distributed systems.
[81] S. Zhang, B. He, D. Dahlmeier, A. C. Zhou, and T. Heinze, “Revisiting
the design of data stream processing systems on multi-core processors,”
in Proc. IEEE 33rd Int. Conf. Data Eng., 2017, pp. 659–670.
[82] J. He, S. Zhang, and B. He, “In-cache query co-processing on coupled
CPU-GPU architectures,” Proc. VLDB Endowment, vol. 8, pp. 329–340,
2014.
[83] K. Sayood, Introduction to Data Compression. Burlington, MA, USA: Hourun Li is a research assistant with the Key Labo-
Morgan Kaufmann, 2017. ratory of Data Engineering and Knowledge Engineer-
[84] D. A. Lelewer and D. S. Hirschberg, “Data compression,” ACM Comput. ing (MOE), Renmin University of China. He joined
Surv., vol. 19, pp. 261–296, 1987. the Key Laboratory of Data Engineering and Knowl-
[85] C. Lin, “Accelerating analytic queries on compressed data,” University of edge Engineering (MOE), in 2020. His major research
California, San Diego, 2018. interests include database systems, and parallel and
[86] C. Rivera et al., “Optimizing huffman decoding for error-bounded lossy distributed systems.
compression on GPUs,” 2022, arXiv:2201.09118.
[87] J. Tian et al., “Optimizing error-bounded lossy compression for scien-
tific data on GPUs,” in Proc. IEEE Int. Conf. Cluster Comput., 2021,
pp. 283–293.
[88] F. Zhang, J. Zhai, X. Shen, O. Mutlu, and X. Du, “POCLib: A high-
performance framework for enabling near orthogonal processing on com-
pression,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 2, pp. 459–475,
Shuhao Zhang received the bachelor’s degree in
Feb. 2022.
computer engineering from Nanyang Technological
[89] F. Zhang et al., “Zwift: A programming framework for high performance
text analytics on compressed data,” in Proc. Int. Conf. Supercomputing, University, in 2014, and the PhD degree in computer
science from the National University of Singapore,
2018, pp. 195–206.
in 2019. He is currently an assistant professor with
[90] X. Huang et al., “Meaningful image encryption algorithm based on com-
Nanyang Technological University. His research in-
pressive sensing and integer wavelet transform,” Front. Comput. Sci.,
vol. 17, no. 3, 2023, Art. no. 173804. terests include high performance computing, stream
processing systems, and database system.
[91] D. Blalock, S. Madden, and J. Guttag, “Sprintz: Time series compression
for the Internet of Things,” Proc. ACM Interactive Mobile Wearable
Ubiquitous Technol., vol. 2, 2018, Art. no. 93.
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4549
Xiaoguang Guo is a research assistant with the Key Anqun Pan is the technical director with the Database
Laboratory of Data Engineering and Knowledge En- R&D Department, Tencent, in China. With more than
gineering (MOE), Renmin University of China. He 15 years of experience, he has specialized in the
joined the Key Laboratory of Data Engineering and research and development of distributed computing
Knowledge Engineering (MOE), in 2020. His major and storage systems. Currently, he is responsible for
research interests include database systems and dis- steering the research and development of the Tencent
tributed systems. distributed database system (TDSQL).
Yuxing Chen received the PhD degree in computer Xiaoyong Du received the BS degree from Hangzhou
science from the University of Helsinki, Finland, in University, Zhejiang, China, in 1983, the ME degree
2021. He currently works as a senior research engi- from the Renmin University of China, Beijing, China,
neer with the Database R&D Department, Tencent, in 1988, and the PhD degree from the Nagoya Insti-
China. His research interests focus on database per- tute of Technology, Nagoya, Japan, in 1997. He is
formance and evaluation, HTAP database design, and currently a professor with the School of Information,
distributed system design. Renmin University of China. His current research in-
terests include databases and intelligent information
retrieval.
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.