0% found this document useful (0 votes)
13 views19 pages

Data-Aware Adaptive Compression For Stream Processing

Data-Aware_Adaptive_Compression_for_Stream_Processing

Uploaded by

howard777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

Data-Aware Adaptive Compression For Stream Processing

Data-Aware_Adaptive_Compression_for_Stream_Processing

Uploaded by

howard777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO.

9, SEPTEMBER 2024 4531

Data-Aware Adaptive Compression


for Stream Processing
Yu Zhang , Feng Zhang , Hourun Li , Shuhao Zhang , Xiaoguang Guo , Yuxing Chen ,
Anqun Pan , and Xiaoyong Du

Abstract—Stream processing has been in widespread use, and continuously incoming data streams, including sensor data [6]
one of the most common application scenarios is SQL query and financial transactions [7]. Nevertheless, stream processing
on streams. By 2021, the global deployment of IoT endpoints systems grapple with the challenge of escalating data volumes
reached 12.3 billion, indicating a surge in data generation. However,
the escalating demands for high throughput and low latency in as the scale of data streams continues to grow [8]. On one hand,
stream processing systems have posed significant challenges due the network bears significant strain from the sheer volume of
to the increasing data volume and evolving user requirements. stream data, hindering real-time functionality in the presence
We present a compression-based stream processing engine, called of noticeable transmission delays. On the other hand, a high
CompressStreamDB, which enables adaptive fine-grained stream rate of data arrival can overload server memory, as stream
processing directly on compressed streams, to significantly en-
hance the performance of existing stream processing solutions. processing systems temporarily store data in memory. Thus, it
CompressStreamDB utilizes nine diverse compression methods becomes imperative to explore innovative approaches aimed at
tailored for different stream data types and integrates a cost model alleviating the memory and bandwidth pressures confronting
to automatically select the most efficient compression schemes. stream processing systems. Data compression, a conventional
CompressStreamDB provides high throughput with low latency technique for minimizing file sizes [9], [10], [11], [12], [13],
in stream SQL processing by identifying and eliminating redun-
dant data among streams. Our evaluation demonstrates that Com- can enhance the efficiency of stream systems and contribute
pressStreamDB improves average performance by 3.84× and re- to a reduction in storage requirements when applied in stream
duces average delay by 68.0% compared to the state-of-the-art processing scenarios.
stream processing solution for uncompressed streams, along with The utilization of compression in stream processing is piv-
68.7% space savings. Besides, our edge trials show an average otal as it enhances the efficiency of stream systems, offering
throughput/price ratio of 9.95× and a throughput/power ratio of
7.32× compared to the cloud design. potentially three key advantages. First, stream processing often
involves a substantial volume of continuous input data with
Index Terms—Data compaction and compression, stream comparable features, such as timestamps [14], [15], transaction
processing, edge computing.
amounts [7], and sensor values [6]. Notably, up to 30% of the data
may be duplicated [16]. Through data compression, the redun-
I. INTRODUCTION dancy in data can be effectively minimized due to the similarity
HE contemporary era of Big Data witnesses extensive of input streams, thereby reducing the volume of stream data.
T use of stream processing technologies [1], [2], [3], [4].
In 2021, active endpoints reached 12.3 billion, reflecting a
Second, in stream processing scenarios, the overhead from mem-
ory access and network transfer between nodes surpasses that
global 9% increase in connected IoT devices [5]. Notably, low of computation [17]. Our experiments reveal that transmission
latency and real-time are two of the most prominent features of can consume up to 70% of the time with a 500 Mbps network.
stream processing, enabling the analysis and querying of vast, Consequently, it is evident that data compression significantly
enhances the efficiency of stream systems. Third, the proven
utility of direct computing on compressed data extends to data
Manuscript received 7 June 2023; revised 26 February 2024; accepted 4 March
2024. Date of publication 19 March 2024; date of current version 7 August 2024. science applications [18], [19], [20], [21], [22], demonstrating
This work was supported in part by the National Natural Science Foundation of its widespread performance benefits.
China under Grant 62322213 and Grant 62172419, and in part by Beijing Nova However, constructing compressed stream direct processing
Program under Grant 20220484137 and Grant 20230484397. Recommended
for acceptance by S. Salihoglu. (Corresponding author: Feng Zhang.) systems faces three major challenges. First, low latency is crucial
Yu Zhang, Feng Zhang, Hourun Li, Xiaoguang Guo, and Xiaoyong Du are for stream processing systems, but the encoding time required
with the Key Laboratory of Data Engineering and Knowledge Engineering by compression methods often introduces significant delays.
(MOE), School of Information, Renmin University of China, Beijing 100872,
China (e-mail: [email protected]; [email protected]; lihourun@ Experiments in Section II-B reveal that using Gzip may account
ruc.edu.cn; [email protected]; [email protected]). for up to 90.5% of the overall stream processing time for encod-
Shuhao Zhang is with the School of Computer Science and Engineering ing, an unacceptable overhead. Second, the processing queries
(SCSE), Nanyang Technological University (NTU), Singapore 639798 (e-mail:
[email protected]). and input data in stream processing scenarios are dynamic and
Yuxing Chen and Anqun Pan are with the Database R&D Department, subject to modification based on user needs. Some compression
Tencent Inc., Shenzhen 518000, China (e-mail: [email protected]; algorithms exhibit lower time overheads for compression and
[email protected]).
Digital Object Identifier 10.1109/TKDE.2024.3377710 decompression, while others offer higher compression ratios.

1041-4347 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4532 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024

To achieve optimal performance, compression must be adaptive r We introduce a system cost model to guide compressed
to different input workloads, necessitating a careful consider- stream processing and design an adaptive compression
ation of the advantages and disadvantages of each algorithm. algorithm selector based on this model.
Third, decompressing data before executing SQL queries can r We devise a processing approach that directly executes
be time-consuming. In our experiments, the time overhead of SQL queries on compressed streams and conducts com-
decompression compared to query execution ranges from 2.09× prehensive experiments to validate its effectiveness.
to 31.37×. This introduces a potential performance impact due
to the additional time and space requirements for decompression.
We introduce CompressStreamDB, a compression-based II. BACKGROUND
stream processing engine designed to overcome the three chal- A. Stream Processing and Streaming SQL
lenges. First, addressing the need for lightweight and fast com-
pression algorithms to meet low-latency and real-time require- Stream processing: Stream processing, a term in data science,
ments in stream processing, CompressStreamDB integrates nine focuses on the real-time processing of continuous streams of
lightweight compression methods to enhance efficiency across data, events, and messages. It encompasses various systems, in-
various input streams. Moreover, deploying the stream process- cluding reactive systems, dataflow systems, and specific classes
ing system on edge devices brings the processor closer to data of real-time systems [31]. The query is the SQL statement
sources, facilitating accelerated data processing. Second, we in- used for data processing, which can be further subdivided into
troduce a fine-grained adaptive compression algorithm selector different operators. In the stream processing context, a stream
capable of dynamically choosing the compression algorithm that comprises a sequence of tuples, where each tuple represents
provides optimal performance benefits for input streams with an event with elements like timestamp, amounts, and values.
varying features. Our system incorporates a cost model that Tuples collectively form batches, which represent processing
guides the selector’s decisions as the workload shifts, consid- blocks containing a specific number of tuples. Within a batch,
ering properties such as the value range and degree of repetition we use the term column to denote elements of different tuples in
in the input data. This model estimates the time consumed by the same field. Stream processing finds extensive applications
each compression algorithm, enabling the selector to choose the in scenarios requiring minimal latency, real-time response with
most efficient one. Third, we propose a method enabling direct minimal overhead (e.g., risk management [32] and credit fraud
querying of compressed data if the data are aligned in memory, detection [14]), and predictable and approximate results (e.g.,
thereby avoiding decompression costs. This approach applies SQL queries on data streams [15] and click stream analyt-
query operations to compressed data with minimal modification, ics [33]).
if hte data maintain their structure after compression. Addition- Streaming SQL: Among various fields of stream processing,
ally, we view lightweight decompression-required techniques Streaming SQL is one of the emerging hot research topics.
as a specific case, integrable into CompressStreamDB. Our Streaming SQL can be perceived as the streaming version of
preliminary work has been presented in [23]. In this paper, we SQL processing on streams of data, instead of the database.
add new platform, new dataset, new compression algorithm, Traditional SQL queries process the complete set of available
and new evaluation. Specifically, the new idea of applying edge data in the database and generate definite results. In contrast,
devices is valuable compared to the cloud. Edge devices show streaming SQL needs to continuously process the arriving data,
potential in stream processing because they have lower costs and and the result is non-determined and constantly changing. As a
can be deployed close to data sources. We analyze the cost and result, this can raise a number of issues, such as how to reduce the
power benefits of edge devices in detail. response time. Streaming SQL owns declarative nature similar to
We conducted experiments in both cloud and edge envi- SQL, and provides an effective stream processing technology,
ronments, employing four widely-used datasets with varying which largely saves the time and elevates the productivity in
properties. The cloud platform utilized an Intel Xeon Plat- stream data analysis. Besides, many stream systems have been
inum 8269CY 2.5 GHz CPU, while the edge platform em- proposed, such as Apache Storm [34] and Apache Flink [35],
ployed a Raspberry Pi 4B. Our experimental results demonstrate whose relational API is suitable for stream analysis, providing
that CompressStreamDB outperforms the state-of-the-art stream a solid development foundation and productive tools.
processing approach, achieving maximum system efficiency.
CompressStreamDB exhibits a throughput increase of 3.84×
and an average latency reduction of 68.0%. In terms of space sav- B. Compression Algorithms
ings, CompressStreamDB reduces data storage needs by 68.7%. Various compression algorithms have been proposed, but
Furthermore, the edge platform exhibits a throughput/price ratio to ensure accurate query results, our system exclusively con-
that is 9.95× higher than the cloud platform, while its through- siders lossless compression algorithms. Lossless compres-
put/power ratio is 7.32× higher than that of the cloud platform. sion algorithms can be categorized into heavyweight and
Overall, we make the following three major contributions. lightweight compression. Noteworthy heavyweight compres-
r We develop a compressed stream processing engine featur- sion algorithms, such as Lempel-Ziv algorithms [10], [12] and
ing diverse lightweight compression methods applicable Huffman encoding [9], [36], offer high compression ratios but
across various scenarios. involve complex encoding and decoding processes, causing
significant time overhead. Given the real-time and low-latency

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4533

TABLE I
EAGER AND LAZY COMPRESSION METHODS IN LIGHTWEIGHT COMPRESSION

requirements of stream processing, which cannot tolerate pro- Internet of Things, end devices, and user terminals to achieve
longed delays, our exploration of heavyweight compression real-time data processing and response, reducing the pressure
algorithms revealed that while they may achieve higher com- on bandwidth, storage, and computing resources caused by
pression ratios, they also result in longer (de)compression times, centralized computing. It meets the requirements of low latency,
offering limited improvement in the performance and stability high bandwidth, and data security. With the rapid development
of the stream system. In preliminary experiments, we utilized of Internet of Things, cloud computing, Big Data, and other
the commonly used compression tool Gzip in stream processing relevant technologies, edge computing has been widely used in
systems. However, the system with Gzip spent 90.5% of the fields such as smart homes [38], smart cities [39], industrial
total time in compression and less than 10% in transmission. Internet [40], and intelligent transportation [41].
Despite its high compression ratio and low transmission time, the The lightweight development of edge devices is the current
compression time overhead could lead to system delays or even and future trend. With the surge of the mobile Internet, edge com-
pauses. Hence, we advocate the use of lightweight compression puting has extended beyond personal computers and servers to
algorithms to expedite stream processing. encompass mobile devices, including edge computing platforms
Lightweight compression: Lightweight compression repre- based on mobile phones [42]. Following the rapid progress of In-
sents a trade-off between compression ratio and time, employing ternet of Things (IoT) technology, edge computing has expanded
relatively simple encoding methods. In contrast to heavyweight into the realm of low-power and embedded devices, such as
compression algorithms, lightweight alternatives sacrifice some Raspberry Pi 4B [43] and microcontrollers [44]. These devices,
compression ratio for faster (de)compression times. We exam- characterized by smaller size, lower power consumption, and
ined a range of works [11], [13], [24], [25], [26], [27], [28], versatility to run in various environments, support multiple com-
[29], [30] on lightweight compression algorithms, covering most munication protocols and data processing algorithms. They can
commonly used ones. Each algorithm has its unique advantages perform tasks like anomaly detection [45], exoskeletons [46],
and disadvantages, which are appropriate to data streams with voice activation [47], object detection [48], and more. The edge
different characteristics. For instance, Elias Gamma encoding device market is anticipated to grow, driven by the increasing
and Elias Delta encoding [11] are suitable for small and large demand for real-time data processing and analysis, as well
numbers, respectively. Run Length Encoding [26] is effective as the need for low-latency, high-bandwidth, and secure data
for data with more repetition. The effectiveness of Null Sup- transmission. According to IDC’s forecast, the global number of
pression [13] depends on redundant leading zeros in the ele- connected devices is expected to surpass 8 billion by 2025, with
ments. Bitmap and its extensions [28], [29], [30] are suitable for approximately 40% of these devices situated at the edge [49].
compressing data with few distinct values. In stream data processing, edge computing can handle data
Eager and lazy compression: We categorize these lightweight at the point of generation, alleviating the burden of data trans-
compression algorithms into two groups: eager compression and mission and storage. This enables faster real-time analysis of
lazy compression [37]. In Table I, we provide a summary of data. For instance, it finds applications in real-time detection
nine common lightweight compression algorithms with the two systems based on sensors [50], video stream analysis [51],
categories. Eager compression algorithms compress subsets of logistics tracking systems [52], network security detection [53],
input tuples as soon as they arrive, allowing them to process and data processing for autonomous vehicles [48]. Edge devices
each tuple without waiting. On the other hand, lazy compression can efficiently compress and process stream data, meeting the
algorithms wait for the entire data batch before compression. The real-time and accuracy requirements of practical tasks.
advantage of eager algorithms lies in their ability to process each
tuple in real-time, while the advantage of lazy algorithms is their
capacity to leverage the similar redundancy in large datasets, III. MOTIVATION
achieving a higher compression ratio.
A. Problem Definition and Basic Idea
Problem definition: We show the problem definition of pro-
C. Edge Computing
cessing compressed stream as follows. The input data streams
Compared to traditional cloud computing architecture, edge are unbounded sequences of tuples, which are generated from
computing pushes computing resources and services to the the data source. The data block to be processed in the stream

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4534 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024

is referred to as window w, containing a sequence of tuples


of a preset size. We use SQL queries to handle these streams.
Each query contains different operators, including projection,
aggregation, groupby, etc. Given a chosen compression algo-
rithm τ , the stream is compressed at source, and the compressed
stream is denoted as R . Finally, compressed streams and queries
are transmitted to the processor. The result of compressed stream
processing consists of tuples in stream after a series of queries.
Our optimization aims at minimizing latency while increasing
throughput.
Basic idea of compressed stream direct prcocessing: To solve
the problem, our basic idea is compressed stream direct prco-
cessing. In detail, we develop a fine-grained adaptive model to
select appropriate compression schemes and perform mapping Fig. 1. Example of smart grids.
between compressed data and operators. For each streaming
SQL operator, we modify the number of bytes it reads and
compress the values it uses. In this way, data can be queried with- in the dataset, including timestamps, the measurement value, the
out decompression, thus saving both time and space. Note that ids of the plug and house, etc.
efficient lightweight decompression-required methods, which Dynamic characteristics: The characteristics of real-time
can bring significant benefits, should also be considered. In our electricity consumption data in the stream are consistently dy-
scenario, we treat it as a special case. namic. Fluctuations in consumption peaks and troughs, house-
hold habits, and usage patterns cause constant shifts in the data
B. Dynamic Characteristics in Stream Processing stream. For instance, when a household generates substantial
power consumption data within a short span, this data may
Special dynamism in stream processing: The dynamic nature manifest as repeated house IDs with changing plug IDs and
of stream data manifests in three key aspects. First, the attributes values within the stream. Similarly, during peak hours when
of stream data, such as value ranges and repetition degrees, can multiple households are consuming electricity simultaneously,
change unpredictably, impacting the achievable compression ra- house IDs might frequently change while timestamps remain
tios of different algorithms. Second, factors like data generation constant.
speed and network delays cause fluctuations in the arrival rate Opportunity: In traditional database processing, analysis of
of stream data, influencing system waiting times. Third, it is the content occurs beforehand, enabling pre-determination of
impossible to predict how frequently data properties can change, the processing method. Conversely, in a stream scenario, data ar-
necessitating a balance between efficiency and overhead when rives continuously, necessitating immediate processing as events
determining the re-decision frequency for dynamic processing. appear. Stream processing methods lack access to complete
Differences from column compression in databases: The dis- information in advance due to the dynamic nature of input
parities between stream processing and traditional databases data streams. Consequently, real-time processing methods need
call for innovative compression strategies. Traditional databases adaptive changes as the input stream evolves. As demonstrated
perform holistic operations post full data scanning, enabling in Section VII, our solution, CompressStreamDB, significantly
the selection of compression algorithms based on overall data outperforms static processing methods, highlighting its superior
attributes. In addition, compression in databases focuses more on performance in dynamic environments.
compression ratios rather than low-latency real-time processing.
Conversely, stream processing involves real-time, unbounded
data streams that constantly evolve, necessitating dynamic up- D. Widespread Use of Compressed Stream Direct Processing
dates to compression methods to adapt to these changes. Our compressed stream direct processing solution offers ex-
tensive applicability across numerous stream applications. We
present illustrative examples showcasing its versatility in various
C. Case Study
scenarios.
Fig. 1 shows a motivation example to illustrate the comparison r IoT sensor data from the smart grid domain [54] is an
between static processing in traditional databases and dynamic underlying scenario, which involves the analysis of en-
processing in a stream scenario. It uses a case study from ergy consumption measurements. This aims to offer short-
smart grids [54]. The smart home market is projected to reach term load predictions and real-time demand management.
a volume of 51.23 billion by 2026, with an estimated 84.9 However, dynamic workload fluctuations present real-time
million active homes and an annual growth rate of 11.7% [55]. challenges. Leveraging compressed stream direct process-
This dataset comprises over 4,055 million energy consumption ing can significantly enhance throughput, enabling efficient
measurements collected from smart plugs installed in private processing of large data volumes within short time frames.
households. It encompasses data from 2,125 plugs distributed r Real-time decision in linear road benchmark [56] specifies
across 40 houses over one month. Seven attributes are contained an expressway variable tolling system. Each vehicle on the

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4535

Fig. 2. Total time breakdown. Fig. 3. CompressStreamDB framework.

highway transmits its location through sensors, which are in collaboration with the server, encompassing data sources like
utilized to calculate tolls based on the specific road section. sensors or smartphones, or intermediate nodes handling data.
Lower tolls incentivize the use of less congested roads. Our Leveraging lightweight compression algorithms, even resource-
solution enables the system to efficiently process substan- constrained devices like data sources can perform compres-
tial volumes of streaming vehicle location data, facilitating sion. Consequently, our system accommodates a multi-layer
more effective decision-making in toll adjustments.
r Cluster management [57] can monitor the execution of architecture with multiple compression client layers, while the
client-server setup represents a simplified model. Compression
computation tasks. The coming data relate to the status functionalities are deployed on the client side. In a distributed ar-
of the cluster, including task submission, state of success chitecture, individual clients perform independent compression
or failure, etc. The anomaly detection with unexpected without coordination. In scenarios where a single query’s input
failures should emit as soon as possible. Our solution can data originates from multiple clients, each client autonomously
provide more rapid response for anomaly detection. determines its compression strategy based on the specific data
Various other real-time stream applications, including man- characteristics. The server manages the processing of queries on
ufacturing equipment detection [58], ship behavior predic- compressed stream data, housing the kernel functions necessary
tion [59], and temporal event sequence detection [60], neces- for executing these queries. It’s important to note that while
sitate efficient stream processing. Fig. 2 illustrates the break- CompressStreamDB is primarily designed for direct processing
down of time utilization in these applications. The complete bar of compressed stream data, it does not dismiss the inclusion
denotes the overall duration of uncompressed stream process- of efficient compression algorithms that require decompression.
ing, with the white segment representing the portion of time They can also be integrated into the system and should not be
consumed by network transmission. Notably, with a 500 Mbps ignored.
bandwidth network, network transmission occupies over 70% Scenario: In a streaming scenario, CompressStreamDB dy-
of the total time. Even on a 1 Gbps network, transmission still namically selects compression algorithms and conducts fine-
accounts for about 50% of the total time. This highlights the grained compression-based stream processing based on speci-
bottleneck created by transmission time in stream applications, fied parameters like network throughput and performance met-
underscoring the critical need for the advantages offered by rics of clients and servers. The compression algorithm selection
compressed stream direct processing. aims to optimize the system’s overall performance, specifically
to minimize the total processing time.
IV. COMPRESSSTREAMDB FRAMEWORK Workflow: After the data are generated in the client of
We propose a fine-grained compressed stream processing CompressStreamDB, the data mainly undergo a series of pro-
framework, called CompressStreamDB, and we show our sys- cesses including compression, transmission, decompression,
tem design in this section. and query, which is also the basis of our proposed system cost
model. Prior to compression, the selector preloads the data and
identifies the compression algorithm that ensures optimal per-
A. Overview
formance. This decision-making process relies on our compre-
CompressStreamDB addresses the challenges mentioned in hensive cost model, considering various factors from machine
Section I, effectively mitigating time and space overhead in metrics and network conditions to the effectiveness and cost
stream processing. It dynamically selects compression algo- of compression algorithms (refer to Section IV-C for details).
rithms and seamlessly integrates them into stream processing. Our system operates at batch granularity, employing distinct
Structure: The CompressStreamDB framework comprises compression algorithms for each data column, as discussed in
two core components: the client and server, depicted in Fig. 3. Section II. Subsequently, the compressed data is transmitted to
The client has a compression algorithm selector based on the the server, where it is processed alongside the corresponding
cost model. This selector is tasked with data collection and SQL queries.
optimal compression algorithm selection. Note that the term Batch: In CompressStreamDB, stream data are processed at
“client” refers to devices seeking compressed stream processing batch granularity. The batch size operates independently from

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4536 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024

the window size in streaming SQL. The window size pertains to larger windows that span multiple batches, reselection impacts
a range concept within SQL, whereas the batch size represents only subsequent batches, without requiring recompression of
the processing granularity of the query engine [1], [14], [15]. previously compressed batches.
It’s worth noting that a batch can be smaller than a window Supported data types: CompressStreamDB not only incor-
or encompass multiple windows. The batch size setting plays a porates lightweight compression algorithms for integers, but
dual role since the growth of batch size can increase both the also supports operations on floating-point numbers and strings.
latency and the compression ratio. We determine the batch size Floating-point items can be converted into integers via multi-
using dynamic sampling, where its overhead can be amortized plying by a factor of 10n [13]. The n here denotes the max-
during stream processing. Users can specify and adjust the batch imum number of decimal places within the data column. For
size based on actual requirements. Experimental insights can be instance, in the context of measured values in smart grids [54],
found in Section VII. the values include numbers such as {3.216, 11.721, 9.8}. With
Flexibility: CompressStreamDB stands as a highly flexible a maximum of 3 decimal places, all data can be scaled by a
system, facilitating not only the support for nine existing data factor of 103 , resulting in {3216, 11721, 9800}. Given that
compression methods but also the seamless integration of ad- data columns typically exhibit closely aligned decimal places,
ditional compression algorithms. This flexibility is designed overflow is uncommon in most scenarios. Overflow risks arise
to effectively address the increasing demand for stream data only when the converted integer exceeds 231 (the limit of a 32-bit
processing. The flexibility of CompressStreamDB empowers it integer). In such cases, we recommend either utilizing a 64-bit
to adeptly handle and analyze diverse data streams, varying in integer representation or exploring the option of employing the
types, scales, and rates. This capability allows the system to dictionary encoding method. For strings, they can be mapped
better align with the demands of real-world tasks. to integers using dictionary encoding, which is a widely-used
Portability: The client of CompressStreamDB is highly method with marginal overhead [61], [62], [63]. Our evaluation
portable, readily adaptable to diverse devices, including embed- in Section VII covers data types of integer, floating-point, and
ded edge devices like Raspberry Pi 4B, requiring minimal mod- string, all of which are encoded as integers before loading. After
ifications. This versatility stems from the lightweight and high- unified data encoding, different types of data can be processed
speed algorithms implemented in the client, demanding min- in CompressStreamDB.
imal computational power. The server of CompressStreamDB Query without decompression: Decompression is employed
is portable and can be deployed to diverse high-performance to restore the original data. CompressStreamDB avoids de-
devices. Its direct SQL operators are universally designed and compression as much as possible, thus reducing time, memory
can be adopted to different compression algorithms and plat- access, and accelerating the query process. In our design, we
forms. The portability of CompressStreamDB empowers its can directly query the compressed data when the compressed
adaptability across diverse devices and platforms, rendering it stream meets the following three conditions. First, the com-
well-suited for resource-constrained environments, notably in pressed data are similar to the data before compression, and
edge computing scenarios. are still structured. Second, the compressed stream data should
be aligned. Third, the compression does not affect the order
of the stream and the process of kernel operation. Our SQL
B. Compressed Stream Processing operators are specially designed for compressed data processing.
Adaptive processing for dynamic workload: In Com- These operators can accept parameters of the number of bytes
pressStreamDB, we dynamically process the input data stream each compressed column occupies. For instance, if the original
using our selector. As detailed in Section II, stream data pro- column holds 4 bytes per element but is compressed to 1 B
cessing is achieved through SQL queries, treating a batch as the per element, our operators handle this column by reading and
minimum processing unit. The system predominantly employs writing only 1 B for each entry. Despite various compression
common relational operators including projection, selection, algorithms encoding raw data differently, their results ultimately
aggregation, group-by, and join. Stream processing is performed conform to a fixed format. As long as the compressed format
through query statements composed of these operators with a meets these three conditions, direct processing of the com-
given size of the sliding window. After a preset number of pressed data is supported. This universal design avoids the
batches, the system dynamically reselects compression algo- complexity of developing separate operator kernels for differ-
rithms for data columns using the system cost model. Com- ent compression methods. Our implementation is portable to
pressStreamDB then scans the next five batches to predict the diverse devices cause it does not require any special hardware
data properties of the follow-up stream, uses the system cost support.
model to calculate latency with the properties, and finally iden- Example: Assume that the stream data includes three
tifies the new processing method with the lowest total processing columns: col1 is 8 bytes, col2 is 4 bytes, and col3 is 4 bytes.
time. Considering that the compression algorithms we use are After compression, col1 is 2 bytes, col2 is 1 B, and col3 is
all lightweight, the overhead of dynamic reselection can be 1 B. A query like “select col1, avg(col2) from data group by
negligible. The batch size and window size in our system are col3” can be mapped to “select col1 , avg(col2 ) from data group
independent of each other. Changes in compression do not by col3 ”. In this way, we only need to update the number of
directly affect the windows. In smaller windows within a batch, bytes to be read for each corresponding column in the operator.
reselecting compression affects multiple windows. However, in The original stream processing operators are mapped to the

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4537

TABLE II com,τ
for each batch. For a chosen compression algorithm τ , Tmemory
SYMBOLS AND MEANINGS
denotes the number of instructions used for memory accesses
com,τ
during compression, while Toperation represents the number of
instructions used for computation. Then the tcompress can be
defined by (2).

tcompress = α · twait
 com,τ

com,τ
Tmemory + Toperation SizeT · SizeB
+ max , . (2)
Nclient Bclient

Nclient means the CPU FLOPS of the client, and Bclient is


corresponding compressed stream processing operators based the memory bandwidth of the client, which can be obtained
on the compression format. from the hardware specifications. The client is responsible for
With such designs, we can compare and calculate the com- compression. If the compression program is a memory-intensive
pressed values directly, allowing us to perform queries directly SizeT · SizeB
on compressed data. Therefore, the result of compression can program for the client, is larger. Otherwise, if
Bclient
com,τ
be applied to the entire query execution, including interme- com,τ
Tmemory + Toperation
diate results, enhancing system efficiency. It’s important to it is a compute-intensive program, is
Nclient
note that CompressStreamDB accommodates diverse compres- larger.
sion schemes. Algorithms requiring lightweight decompression As mentioned in Section II-B, the eager compression algo-
should be considered if their performance benefits outweigh the rithms compress data immediately, while the lazy compression
decompression overhead. algorithms need to wait until the whole data batch arrives. Hence,
if we use twait to represent the time spent waiting for a data
C. System Cost Model batch, we can use α · twait to calculate the time that τ spends
To guide the system to automatically select a suitable on waiting, where the variable α is defined in (3).
compression algorithm at runtime, we propose a cost model 
1, if the compression algorithm τ is lazy;
for stream processing systems with compression. Previous α= (3)
works [13], [64] provide the cost models only for the compres- 0, if the compression algorithm τ is eager.
sion algorithms, but not stream processing. As far as we know,
2) Transmission time: We denote the time allocated to trans-
our work is the first to provide the cost model for compressed
mission as ttrans . r represents the compression ratio achieved
stream processing. The difficulty of proposing a cost model for
by the selected algorithm as outlined in Section V. Then, we
compressed stream processing lies in the complexity of process- SizeT
ing procedures and scenarios. In our processing scenario, we use to represent the tuple size after compression. Con-
r
take the machine metrics, network conditions, and other exten- sidering SizeB as the number of tuples within a batch, the
sive factors into consideration, and solve the above-mentioned byte size required for transmitting the compressed batch equates
difficulties through a multi-step cost model. SizeT · SizeB
The process of CompressStreamDB consists of four primary to . If network bandwidth suffices and queuing
r
stages: compression, transmission, decompression, and query delay remains negligible, ttrans can be expressed as shown in
processing. To model the costs across these stages, we develop (4).
a system cost model. Table II outlines the key parameters utilized
SizeT · SizeB
in our system cost model. ttrans = · latency. (4)
We represent the time of compression, data transmission, de- r
compression, and query processing by tcompress , ttrans , tdecom , When the network bandwidth is fully occupied, the calcula-
and tquery , respectively. The whole process time is represented tion of ttrans can be given as (5).
as t. Based on the above analysis, we can represent the system
cost of compressed stream processing in (1). SizeT · SizeB
ttrans = . (5)
t = tcompress + ttrans + tdecom + tquery . (1) r · bandwidth

In the following part of this section, we delve into the con- 3) Decompression time: Considering the importance of
siderations of cost model, including machine metrics, network future-proof compression algorithms that require decompres-
conditions, and the efficiency of compression techniques. We sion pre-processing, our system retains the flexibility to incor-
focus on modeling the time consumption across these four porate these methods. For a chosen compression algorithm τ ,
decom,τ
aspects. Tmemory symbolizes the number of instructions executed for
decom,τ
1) Compression time: For the processing batch, SizeT rep- memory access during decompression, while Toperation denotes
resents the number of bytes per tuple, while SizeB means the the count of computational instructions. This allows us to define
number of tuples per batch, thus there are SizeT · SizeB bytes tdecompress as per (6).

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4538 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024

tdecompress = In the rest of this section, we will consistently employ the


 decom,τ
 symbol SizeB to signify the processing batch size. Furthermore,
decom,τ
Tmemory + Toperation SizeT · SizeB we introduce a new symbol, SizeC , representing the size of each
β · max , . (6)
Nserver Bserver element in the column. Our discussion on each compression
algorithm will introduce a set of parameters related to dataset
Nserver relates to the CPU FLOPS of the server, which properties. These parameters will be explicitly described within
is similar to Nclient , and Bserver means the memory the context of each compression algorithm, offering specific
bandwidth of the server. If the decompression program insights into their characteristics.
leans towards memory-intensive operations on the server,
SizeT · SizeB
is larger. Alternatively, if it’s more compute- B. Eager Compression
Bserver
decom,τ
decom,τ
Tmemory + Toperation Eager Compression algorithms compress the arrived elements
intensive, is larger. β indicates whether immediately and do not need to wait for the whole batch. Hence,
Nserver
τ needs decompression, which is defined in (7). their α value in the system cost model is 0.
⎧ Elias Gamma encoding (EG): The idea of Elias Gamma
⎨1, if the compression algorithm τ encoding is to use the number of leading zeros to represent the
β= needs decompression; (7)
⎩ binary valid bits of the data [11]. Given a positive number x,
0, otherwise. let L = log2 x. The Elias Gamma encoding form of x can be
4) Query time: The query is executed on the server through represented as L zero bits followed by the binary form of x.
kernel functions. The compression process mainly affects the Hence, it uses 2 ∗ L + 1 bits to represent the number x.
efficiency of kernel functions on memory read and write, Elias Gamma encoding, being a variable-length encoding,
but does not affect the computation process. Because Com- lacks a fixed number of bits. However, as discussed in Sec-
pressStreamDB reads and writes in bytes, the memory read and tion IV-B, we align its encoding to ensure consistent byte rep-
write time is proportional to the number of bytes occupied in resentation. For a specific data column, EGDomain signifies
memory. We use tquery
operation to represent the computation time
the maximum byte requirement for Elias Gamma encoding in
of a query, and tquery
memory to represent the query time spent on that column. Consequently, data within this column involve
memory read and write. Note that both of them represent the EGDomain bytes during processing. This encoding method
processing time in the uncompressed condition. We can obtain might struggle with outliers, necessitating more uniform input
tquery by (8). data. An occurrence of significant outliers in the input data
could notably escalate redundancy across the entire column,
tquery
memory
tquery = tquery
operation + . (8) potentially impacting efficiency.
r The Elias Gamma encoding method processes only posi-
Please note that r signifies the compression ratio during the tive integers, while our data comprises non-negative integers.
query execution, distinct from the compression ratio denoted as To accommodate this, we increment each integer by 1 before
r. The determination of r hinges on whether the server under- compression and decrement by 1 during decompression. This
takes decompression as part of the query process. Its definition simple adjustment effectively converts non-negative integers for
is represented by (9), indicating its role in the query phase. compatibility with the encoding scheme. The compression ratio
⎧ r of Elias Gamma encoding can be described in (10).
⎨1, if the compression algorithm τ
r = needs decompression; (9) SizeC
⎩ r= . (10)
r, otherwise. EGDomain
The storage format of Elias Gamma encoding in Com-
V. SELECTED COMPRESSION ALGORITHMS
pressStreamDB ensures alignment to a consistent byte length
CompressStreamDB incorporates nine lightweight compres- while maintaining the structured nature of compressed data.
sion algorithms, detailed in Section II. This system dynamically Utilizing this format, CompressStreamDB bypasses decompres-
chooses among these options at runtime using our innovative sion. The parameters associated with Elias Gamma encoding
system cost model, and partial parameters of cost model relate SizeC
include β = 0, r = r = .
to the compression algorithms. r Pros: 1) It is suitable for scenes where small integers are
EGDomain

A. Preliminaries used frequently. 2) It can avoid the overhead of decompres-


sion.
We present the system cost model parameters associated r Cons: 1) The speed of Elias Gamma encoding is relatively
with each compression algorithm, delineating their respective slower in contrast to other lightweight algorithms, due
advantages and drawbacks, including parameters such as α, to the requirement of performing a logarithmic operation
β, r, and r . The parameters Tmemory
com,τ com,τ
, Toperation decom,τ
, Tmemory , before adding leading zeros. 2) It is not good at handling
decom,τ
and Toperation pertaining to the number of instructions can be large outliers.
directly derived by inspecting the assembly source code. This Elias Delta encoding (ED): Elias Delta encoding, a variant of
approach allows for the construction of a tailored cost model Elias Gamma encoding, extends the process by performing an
specific to each compression algorithm. additional Elias Gamma encoding on the value derived from the
Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4539

first encoding [11]. Given a positive number x, let L = log2 x, design, with values ranging between 1 and 4 bytes, we use two
and N = log2 (L + 1). The Elias Delta encoding form of x has bits to signify the byte count per value. Consequently, every
1) N zero bits, followed by 2) N + 1 bit binary representation group of four elements requires an extra byte to record their
of L + 1 and 3) the last L bits of the binary form of x. Hence, it lengths.
uses 2 ∗ N + L + 1 bits to represent the number x. Compared with null suppression with fixed length, null sup-
Same as Elias Gamma encoding, we extend Elias Delta encod- pression with variable length has a better compression ratio in
ing to make it suitable for non-negative integers. In addition, we most cases. It has a good effect on elements of different sizes.
have also aligned its encoding results. For a given data column, However, when column elements predominantly fall within a
We use EDDomain to represent the maximum number of narrow range, the additional bytes used to note their lengths can
bytes required for Elias Delta encoding for the elements of this become an overhead that’s difficult to overlook.
column. Then, the data in this column are all EDDomin bytes Utilizing the introduced V alueDomain array in NS, the total
during processing. It requires more bits for compressing small number of bytes needed after compression can be derived by
values compared to Elias Gamma encoding, but performs better summing the values within V alueDomain. The compression
with larger values. For larger integers, the length of Elias Delta ratio r for null suppression with variable length is determined
encoding approaches entropy, making it nearly optimal. The by (13).
compression ratio r for Elias Delta encoding can be expressed
SizeC · SizeB
using (11). r=  . (13)
SizeB /4 + Size
i=1
B
V alueDomaini
SizeC
r= . (11) Because the compressed elements are not byte-aligned, they
EDDomain
have to be decompressed before processing. We have its param-
SizeC
Its parameters are: β = 0, r = r = . eters: β = 1, r = 1.
r Pros: 1) It is more stable than Elias Gamma encoding. 2)
EDDomain r Pros: 1) It can make better use of space and achieve a
It can handle a larger range of values. higher compression ratio compared to NS. 2) It can handle
r Cons: 1) Its compression process is more complicated, so the situation of large data changes and is not easily affected
the compression speed can be slower. 2) When the value is by outliers.
r Cons: 1) It needs decompression before processing. 2) It
very small, its performance is not as good as Elias Gamma
encoding. needs extra bytes to record the length.
Null suppression with fixed length (NS): The null suppression
with fixed length method removes leading zeros from the binary C. Lazy Compression
representation of the element, efficiently eliminating the redun- Lazy compression algorithms wait until the entire input batch
dancy caused by the data type [24]. “With fixed length” means arrives, and then compress the whole batch. Hence, their α value
that the elements of the compressed data have the same number in the system cost model is 1.
of bits. Base-Delta encoding (BD): Base-Delta encoding is ideal for
To estimate the compression effects of null suppression with scenarios with large values and a limited data range, or when dif-
fixed length and null suppression with variable length, we in- ferences between values are considerably smaller than the values
troduced the V alueDomain array. The size of this array is themselves [25]. This method selects a base value from a series
the same as the batch size. It records the number of bytes of values and stores it. Each element is represented by its delta
required to represent the valid bits of each element in the column. value from this base. If the delta value is significantly smaller
V alueDomainM AX denotes the bytes used by elements after than the original element, it can be efficiently represented with
null suppression with fixed-length. The compression ratio r for fewer bytes.
this method is given by (12). We designate BDDomain to represent the maximum bytes
SizeC needed for Base-Delta encoding in a data column. The compres-
r= . (12) sion ratio r for Base-Delta encoding can be calculated using (14).
V alueDomainM AX
Similarly, its parameters are: β = 0, r = r =
SizeC SizeC
. r= . (14)
V alueDomainM AX BDDomain
r Pros: Its compression is very convenient and can be
It can avoid decompression in CompressStreamDB. Then, we
performed efficiently. SizeC
r Cons: It is not good at handling large outliers. have its parameters: β = 0, r = r = .
Null suppression with variable length (NSV): Null suppres- r Pros: It achieves fast compression due to its reliance on
BDDomain
sion with variable length is similar to null suppression with fixed basic vector addition and subtraction operations.
length, achieving compression by removing leading zeros [24]. r Cons: It is only suitable for data with a small range of
Unlike the fixed-length method, it doesn’t mandate a consistent variation.
number of bits in the compressed output. Instead, it records the Run length encoding (RLE): Run Length Encoding is a clas-
byte length of each encoded value for decompression. In our sical compression algorithm [26] effective for datasets featuring

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4540 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024

recurring sequences of elements. It efficiently reduces space by thereby achieving a higher compression ratio. Moreover, the
compressing repetitive data that occurs periodically. first different value after this sequence will also be compressed
Suppose the average run length of a column of data batch is into the same element. In other words, PLWAH can merge all
represented by AverageRunLength. As run-length encoding elements filled with 0 or 1.
requires an additional integer variable (4 bytes) to represent the Example: We use a simplified 8-bit example to illustrate
run length, the compression ratio r for run-length encoding is the compression scheme of PLWAH. Assume that we have 4
defined in (15). Bitmap entries to compress and the original data is [00000000,
SizeC · AverageRunLength 00000000, 00000000, 00100000]. Then the compressed data is
r= . (15) an 8-bit entry [1 0 011 011]. The first 1 indicates that this word is
SizeC + 4
a compressed fill word. The second 0 indicates that the fill word
RLE does not maintain byte alignment and disrupts the origi- is filled with 0. The next three digits “011” indicate a literal word
nal data structure, necessitating decompression before process- with “1” in the third digit (00100000). The last three digits “011”
ing. Consequently, its parameters are defined as follows: β = 1, indicate three consecutive 0 fill words. In this way, the original
r = 1.
r Pros: 1) Its compression speed is relatively fast. 2) four items are compressed into one item. In the context of a
32-bit representation PLWAH, the first 2 bits signify the word
It can achieve a high compression ratio for data if type, the subsequent 5 bits denote the position of the next literal
AverageRunLength is high.
r Cons: 1) It needs decompression before processing the word’s “1”, and the final 25 bits indicate the number of merged
fill words [28]. Note that our bitmap approach is designed to
data. 2) It only applies to continuously repeated data. compress data with specific types of values. The compressed
Dictionary (DICT): The dictionary compression algorithm bitmap can only have at most one “1”, with the remaining bits
is commonly used to convert larger data into smaller data by set to “0” [13]. PLWAH can be applied in such scenarios.
establishing a one-to-one relationship [27]. If the number of In practice, the most frequently occurring element can be
data types is denoted as Kindnum, the compression ratio r of mapped to the element filled with 0 in the bitmap, and the second
dictionary encoding can be defined using (16). most frequently occurring element can be mapped to the element
SizeC filled with 1 in the bitmap. This strategic mapping significantly
r= . (16)
log2 Kindnum/8 reduces space allocation for these two elements in the entire
It is byte-aligned and structured, so it can avoid decom- column. If we denote the count of the most frequent element as
pression. Accordingly, its parameters are: β = 0, r = r = M ostCount, and the count of the second most frequent element
SizeC as SecondCount, then the compression ratio r of PLWAH is
. formally defined in (18).
log2 Kindnum/8
r Pros: It has a relatively high compression ratio.
r Cons: It is appropriate for use when there are only a few SizeB
r= . (18)
types of data. SizeB − M ostCount − SecondCount
Bitmap: The bitmap compression algorithm is relatively con-
cise, using a bit string to represent the original data [13], [28], It destroys the data structure of the original data, so: β = 1,
[29], [30]. Each bit in the bit string corresponds to a unique ele- r = 1.
r Pros: It has much higher compression ratio than bitmap.
ment in the original data. If the number of data types is denoted
r Cons: It is appropriate for use when there are only a few
as Kindnum, the compression ratio r of bitmap encoding can
be defined using (17). types of data.
SizeC
r= . (17)
2log2 Kindnum) /8 VI. IMPLEMENTATION
It destroys the data structure of the original data, so: β = 1, We implement CompressStreamDB with references to [14],
r = 1. [15], [65]. It comprises two primary modules: a client module
r Pros: It has fast compression and decompression speed. housing the stream processing compression algorithms and the
r Cons: It is appropriate for use when there are only a few adaptive selector, and a server module equipped with funda-
types of data. mental SQL operators to process compressed streams. These
Position list word aligned hybrid (PLWAH): To demonstrate operators include selection, projection, groupby, aggregation,
the flexibility of CompressStreamDB in compression algo- and join. The server module takes charge of handling these
rithms, besides the original compression algorithms, we further compressed streams efficiently. Additionally, we integrate a
extend a highly efficient variant of compressed Bitmap, PLWAH, profiler into the server, facilitating the collection of key per-
into our system. The evaluation indicates that PLWAH further formance metrics, including (de)compression and transmission
improves the performance of our system, detailed in Section VII. times. It’s important to note that the compression functionality
PLWAH is an efficient compressed bitmap data structure with within CompressStreamDB can be turned off, allowing sup-
a high compression ratio and fast operation capability [28]. port for processing uncompressed streams. In scenarios involv-
When a sequence of elements filled with 1 s or 0 s appears, ing small-window queries, like handling a single tuple, Com-
the PLWAH algorithm compresses them into a single element, pressStreamDB can seamlessly execute uncompressed stream

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4541

processing without waiting for the entire data batch. This hy- 20.04.3 LTS with Java 8. Our server is deployed on both a cloud
brid processing mode significantly extends the applicability of platform and an edge platform. Their information is as follows.
system across a broad spectrum of stream processing scenarios. r Cloud platform. It has the same configuration as the client.
In our batch implementation, a sliding window can extend The price of the CPU is about $899. Its TDP is 205 W,
across multiple batches. To address this challenge, our system which indicates that it cannot be used at the edge because
incorporates a batch buffer, temporarily storing data from the of high power consumption. We mainly conduct system
previous batch. Upon detecting a sliding window that spans performance experiments on this platform, and use the
across batches, the system awaits the arrival of the subsequent turbostat tool to monitor its power.
batch. At this point, it retrieves the previously cached batch from r Edge platform. Our edge device is Raspberry Pi 4 Model
the buffer, facilitating the computation of results across sliding B [68] . It is equipped with Quad core ARM Cortex-A72
windows that cross batch boundaries. 64-bit SoC and 8 GB memory, running Raspberry Pi OS
CompressStreamDB can be deployed among different envi- 5.10 with Java 8. Its price is $75, with maximum 6.4 W
ronments. For example, Apache Storm can customize serializer. power consumption [69] . We use the edge device to prove
We can wrap the compression module of CompressStreamDB the portability of our system, and show the advantages
into a custom serializer, and then embed it into Storm for use. of using edge device for stream processing. To detect
However, this incurs additional challenges such as model inte- the Raspberry Pi’s power consumption, a power meter is
gration and Storm internal implementation overhead, especially attached.
in distributed environments. Because our work focuses on the Datasets: Our evaluation incorporates four datasets widely
adaptive selection of compressions in stream processing, we used in previous studies [14], [15], [61], [70], [71], [72], [73],
leave the adaptation to other systems as future work. [74], [75], all of which remain relevant in current discussions.
For instance, the smart home market is projected to reach 51.23
VII. EVALUATION billion by 2026, growing at an annual rate of 11.7% [55],
emphasizing the continued significance of these datasets. The
A. Experimental Setup first dataset originates from energy consumption measurements
Methodology: The baseline for comparison is Com- in smart grids [54], capturing data from various devices within
pressStreamDB without compression. Our system offers the a smart grid to enable load predictions and real-time demand
ability to disable the compression function, allowing uncom- management in energy consumption. The second dataset, com-
pressed stream processing. The comparison against this baseline pute cluster monitoring [57], is derived from a Google cluster,
aims to evaluate whether our solution enhances performance in simulating a cluster management scenario. The third dataset,
stream systems. To better demonstrate the benefits of our adap- the linear road benchmark [56], records vehicle position events
tive compression approach, we conducted a performance com- and models a network of toll roads. The fourth dataset is Star
parison between our implementation and two high-performance Schema Benchmark (SSB) [76]. It contains one fact table, four
stream processing systems–Saber [14] and FineStream [15]. dimension tables, and thirteen standard queries. We adjust SSB
Notably, both Saber and FineStream operate without employing for stream processing. The adaption for more benchmarks are
compression techniques. We implement the Base-Delta encod- shown in Section VII-E.
ing compression with reference to TerseCades [66], denoted as Queries: We utilize eight queries to evaluate the perfor-
“Base-Delta”. TerseCades stands as a pioneering exploration mance of adaptive compression in CompressStreamDB. For
of stream processing with data compression, demonstrating its each dataset, we execute two queries to derive performance
efficacy in this domain. Our work showcases progressive ad- metrics, evaluating various processing methods including the
vancements in performance compared to the Base-Delta encod- baseline, nine lightweight compression algorithms, and Com-
ing employed in TerseCades. As TerseCades isn’t open-source, pressStreamDB. These queries are well-established in prior
we re-implement its functionalities. Comparing the result of stream processing studies [14], [15], [72], [73], [74], [75]. The
our implementation with those presented in the TerseCades pa- specific details of the eight queries are outlined in Table III. Q1
per [66], we observe similar outcomes. For instance, in [66], the and Q2 analyze the anomaly detection in smart grids dataset. Q3
system processed queries on the Pingmesh Data [67] achieving a and Q4 operate on the linear road benchmark dataset. Queries
throughput of 37.5MElems/s. In our implementation using Base- Q5 and Q6 interact with the Google compute cluster monitoring
Delta encoding, we accomplish a throughput of 37.2MElems/s data. Q7 and Q8 tackle the Star Schema Benchmark. To adapt
when querying the Smart Grid Data [54], showing a compa- to stream processing; we rewrite Q1.1 and Q1.2 of SSB to adapt
rable performance level. Our study extends further by com- to stream processing.
paring the adaptive compression stream processing capability In the Smart Grid and Linear Road Benchmark datasets, a
of CompressStreamDB across nine lightweight compression batch encompasses 100 windows, and each window contains
algorithms. To exhibit the portability of CompressStreamDB, 1024 tuples. In the case of the Cluster Monitoring dataset, a batch
we conduct experiments on both cloud and edge platforms, comprises 200 windows, with each window consisting of 512
analyzing and comparing their throughput, power efficiency, and tuples. Finally, within the Star Schema Benchmark dataset, each
cost efficiency. batch contains 100 windows, and each window encompasses 512
Platforms: Our client is equipped with an Intel Xeon Platinum tuples. The performance result for each dataset is the average of
8269CY 2.5 GHz CPU and 16 GB memory, running Ubuntu the results of related queries.

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4542 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024

TABLE III
QUERIES USED IN EVALUATION

CompressStreamDB outperforms NS by 24.1% in system per-


formance. Third, in the context of the Google Cluster Moni-
toring dataset, CompressStreamDB demonstrates a remarkable
3.85× increase in throughput, while BD achieves a 2.36× im-
provement. CompressStreamDB achieves a substantial 63.1%
throughput improvement over BD. Finally, concerning the Star
Schema Benchmark dataset, CompressStreamDB achieves a
commendable 1.92× improvement in throughput, compared
to BD’s 1.68× enhancement. CompressStreamDB outperforms
BD by 14.3% in system performance.
Based on the observed throughput outcomes, several signifi-
Fig. 4. Throughput of different compression methods. cant insights emerge. First, compression can obviously improve
the throughput of the stream processing system, which has been
demonstrated in [66]. While BD consistently showcases robust
system performance across various scenarios, the adaptive com-
B. Performance Comparison pression within CompressStreamDB consistently outperforms
Throughput: We delve into the throughput analysis of Com- it. Second, the impact of compression in stream processing is
pressStreamDB across the eight queries performed on the four notably contingent upon dataset properties. For instance, when
datasets. The results are shown in Fig. 4, showcasing the per- the AverageRunLength of a dataset is low, RLE fails to deliver
formance of each dataset with distinct processing methods. Our substantial performance improvements. Thus, the algorithm se-
baseline achieves similar throughput in comparison to Saber and lection process emerges as a pivotal factor in enhancing system
FineStream, showcasing a performance similarity between our performance. Third, CompressStreamDB adeptly amalgamates
baseline and other state-of-the-art systems. On average, Com- the strengths of diverse algorithms, showcasing remarkable
pressStreamDB demonstrates a remarkable 3.84× throughput adaptability across varying datasets. Its capability to consistently
improvement over our baseline. Note that due to the presence of match or surpass the performance of individual compression
negative numbers in the linear road benchmark dataset, EG and algorithms underscores its versatility across datasets of varying
ED cannot be applied to it. Here are our key observations. complexities.
First, on the Smart Grid dataset, CompressStreamDB show- Latency: Fig. 5 reports the latency of different compression
cases a substantial 6.76× increase in throughput compared to algorithms on the four datasets. In our work, latency represents
the baseline. Among the individual compression algorithms, the time from data input to the query result output. Similar
DICT encoding stands out, delivering a 3.00× throughput im- to throughput, latency is an important target of the system
provement over the baseline. However, CompressStreamDB performance. The latency of our baseline is similar compared
outperforms DICT encoding, achieving a notable 2.25× im- to Saber and FineStream, demonstrating similar performance
provement in throughput. Second, concerning the linear road with state-of-the-art systems. On average, CompressStreamDB
benchmark dataset, CompressStreamDB demonstrates a signif- achieves 68.0% lower latency. Moreover, we have the fol-
icant 2.83× increase in throughput over the baseline, while lowing observations. First, on the Smart Grid dataset, Com-
NS achieves a 2.28× improvement over the baseline. Notably, pressStreamDB achieves an 85.2% reduction compared to the

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4543

Fig. 5. Latency of different compression methods. Fig. 7. Time breakdown of compression and decompression. CmpStr is short
for CompressStreamDB.

method performance remains suboptimal with dynamic work-


loads due to its rigid data handling approach. In contrast, Com-
pressStreamDB exhibits consistent performance as its adaptive
processing method dynamically adjusts based on data charac-
teristics, ensuring stable performance even amidst changes.

C. Analysis of Time and Space Savings


(De)compression time: As we mentioned in Section IV-C,
CompressStreamDB targets diverse lightweight fast compres-
Fig. 6. Speedup with dynamic workload. sion methods. For the compressions that can bring signifi-
cant performance benefits, even decompression is required, we
still involve them in CompressStreamDB. In our experiments,
uncompressed system and surpasses the performance of DICT we include four lightweight decompression-required methods,
by 55.6% in latency reduction. Second, on the linear road bench- RLE, bitmap, NSV, and PLWAH. We conduct experiments on
mark dataset, CompressStreamDB delivers a 64.6% decrease in three datasets to statistic the compression time and decom-
latency compared to the baseline and outperforms NS by 19.4%. pression time during processing. Results are shown in Fig. 7.
Third, on the Cluster Monitoring dataset, CompressStreamDB Our observations are as follows. First, NS is a simple and
exhibits a 74.0% reduction in latency against the baseline and fast compression technique, exhibiting the shortest sum time
surpasses BD by 38.7%. Finally, on the Star Schema Benchmark of compression and decompression across all four datasets.
dataset, CompressStreamDB demonstrates a 48.0% decrease in Notably, NS consistently provides commendable compression
latency compared to the baseline and outshines BD by 12.7%. ratios across diverse scenarios, ensuring high throughput and
In summary, CompressStreamDB consistently delivers superior latency performance. In contrast, EG, ED, and PLWAH, while
latency performance compared to individual compression algo- categorized as lightweight algorithms, demonstrate relatively
rithms. slower performance. Their computational intricacies and coding
Dynamic workload: To illustrate the comparison between processes contribute to this disparity, potentially impacting their
the static processing solution and the dynamic processing in overall efficiency. Second, NSV primarily invests more time in
CompressStreamDB, we use the datasets and benchmarks to decompression due to the translation process of byte lengths,
generate dynamic workloads and evaluate on them. Using Q1 which influences its behavior. However, it’s crucial to note that
and Q2 on Smart Grids as an illustrative example, we present transmission time constitutes the predominant portion of the
speedup comparisons across various network bandwidths in total processing time. Specifically, in the case of lightweight
Fig. 6. This trend remains consistent across other cases, show- compressions, including NSV, decompression time represents
casing similar performance behaviors. We denote “Static” for less than 1.0% of the total processing time, rendering it neg-
the static compressed processing method with the optimal per- ligible. Third, CompressStreamDB does not rank as the most
formance on the dynamic workload, while CompressStreamDB, time-efficient method in compression and decompression. It
denoted as “CompressStreamDB”, applies our dynamic design. takes a moderate amount of time. However, CompressStreamDB
The experimental results demonstrate that CompressStreamDB prioritizes the overall system performance rather than focusing
outperforms the static solution across varying network condi- solely on optimizing compression efficiency.
tions. Particularly, in a network with 100 Mbps bandwidth, Relation between time and compression ratio: Table IV shows
CompressStreamDB exhibits remarkable performance enhance- the relation between time and compression ratio r. On average,
ment, achieving a 9.68× speedup over the baseline and 3.97× CompressStreamDB saves 68.7% space and 66.1% transmis-
over the optimal static method. The performance of static sion time. We have the following observations. First, among

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4544 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024

TABLE IV
RELATIONS BETWEEN TIME AND COMPRESSION

Fig. 9. Comparison of throughput on cloud and edge device.


Fig. 8. Relation between compression time and instructions.

compute-intensive situation. The line in Fig. 8 represents the


number of instructions used in compression for processing each
all processing methods across datasets, CompressStreamDB tuple. Meanwhile, the bar denotes the compression time without
demonstrates the lowest trans_time ratio. Although BD, as considering the waiting time. Fig. 8 reveals a nearly proportional
utilized in TerseCades, exhibits a notable advantage of 60.1% relationship between the number of instructions executed and
on average, CompressStreamDB still manages to surpass BD the compression time. This aligns closely with the estimations
by 21.6%. Second, CompressStreamDB achieves the highest outlined in our cost model.
compression ratio r, or the lowest 1/r, across all processing
methods for each dataset. For instance, on the Smart Grid dataset,
CompressStreamDB achieves an r of 6.49, surpassing BD’s D. Comparison Between Edge Device and Cloud Device
2.80. This performance marks a 2.32× higher compression ratio In this section, we first compare the throughput on the cloud
r compared to the optimal single compression algorithm. Third, and edge platforms, and then compare their throughput/price
transmission time ratio and 1/r are positively correlated and ratio and throughput/power ratio, which demonstrate the cost
change proportionally, so as to the ratio of query time and 1/r . and energy benefits of the edge device.
According to (4) and (5), a high compression ratio r directly Throughput: To demonstrate the portability of Com-
implies lower transmission time. Given our use of lightweight pressStreamDB and further explore its performance, we con-
compression methods, the time incurred during compression duct comparative experiments on the edge platform. We use
and decompression remains marginal. Hence, the method with Raspberry Pi 4B as the edge platform for comparison. Fig. 9
the highest compression ratio can significantly enhance system illustrates the throughput comparison between the cloud and the
performance. CompressStreamDB attains its high performance edge across four datasets. On average, the cloud exhibits 1.21×
predominantly through its exceptional compression ratio r. the throughput of the edge across these datasets. Consequently,
Executed instructions versus compression time: In our cost the performance of the cloud surpasses that of the edge, resulting
model, compression time and decompression time relate to the in superior speedup on the cloud platform.
number of executed instructions. To validate this assumption, Despite lacking performance advantages compared to cloud
we conduct an exploration using BD, Bitmap, and DICT on platforms, edge devices offer distinct benefits in terms of lower
the Smart Grid dataset. Fig. 8 illustrates the relationship be- power consumption, reduced costs, and their aptness as localized
tween the number of executed instructions and the time spent servers closer to end-users. Therefore, evaluating the efficacy
in compression. Note that waiting time is not factored into of edge devices should encompass considerations of price and
this analysis. According to (2), compression time is expected power consumption. To assess this, we introduce two critical
to be proportional to the number of executed instructions for ratios: the throughput/price ratio, calculated as the ratio of

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4545

Fig. 12. Accuracy of the cost model. CmpStr is short for CompressStreamDB.
Fig. 10. Comparison of throughput/price ratio on cloud and edge device.

Fig. 13. Effect of batch size on latency and space usage.


Fig. 11. Comparison of throughput/power ratio on cloud device and edge
device.

for illustration, as shown in Fig. 12. The dashed line and the solid
line show the estimated time and measured time respectively.
throughput to price (USD), and the throughput/power ratio,
All estimated values are slightly less than the actual values
denoting the ratio of throughput to power (Watt).
because of additional overhead caused by the system operation.
Throughput/price ratio: Fig. 10 shows the throughput/price
On average, the system cost model within CompressStreamDB
ratios on the cloud server and edge device. In the Smart Grid
achieves an accuracy of 89.2%. This level of accuracy suggests
dataset, the throughput/price ratio of the edge CPU surpasses
that the cost model is reliable and suitable for estimating the cost
that of the cloud CPU by 10.81×, and for the remaining three
associated with stream processing integrated with compression
datasets, the ratios stand at 8.90×, 10.91×, and 9.18× respec-
techniques.
tively. On average, the edge device exhibits a 9.95× higher
Batch size: As outlined in Section IV-A, the batch size is
throughput/price ratio compared to the cloud, showcasing a pro-
a factor influencing both latency and compression ratio. Us-
nounced cost advantage of utilizing edge devices. These findings
ing the Smart Grid workload as an illustration, where each
affirm the cost-effectiveness inherent in the design purpose of
window contains 1,024 tuples. We depict their interrelation in
the Raspberry Pi, known for its economical utility.
Fig. 13, considering three distinct network settings: 1 Gbps
Throughput/power ratio: In Fig. 11, the throughput/power
network, 100 Mbps network, and a single node without net-
ratio for both cloud and edge platforms is presented. Notably,
work transmission. Our observations are as follows. First, at
the average power consumption of the cloud CPU stands at
100 Mbps, we observe a notable rise in latency corresponding
33.5 W, whereas the edge device operates at an average power
to larger batch sizes. Conversely, within the 1 Gbps network and
of 3.8 W. In the Smart Grid dataset, the throughput/power ratio
single-node mode, batch size exhibits comparatively minimal
on the edge surpasses that of the cloud by 7.95×, while for
impact on system latency. This divergence arises from the con-
the other three datasets, these ratios are 6.55×, 8.03×, and
straints imposed by limited network bandwidth, leading to data
6.75× respectively. On average, the edge platform demonstrates
queuing before transmission. Consequently, larger batch sizes
a 7.32× higher throughput/power ratio compared to the cloud.
can induce system pauses. Second, as batch size increases, the
These findings underscore the potential for significant energy
space occupancy decreases. This phenomenon is attributed to the
savings by leveraging edge devices compared to cloud-based
improved utilization of data redundancy with larger batch sizes.
operations.
Third, the absence of an optimal batch size suggests that its de-
termination requires specific situational analysis. Furthermore,
E. Design Tradeoffs and Discussion
we conducted measurements on cross-batch sliding windows,
Model accuracy: We verify the accuracy of our system cost varying the window slide size within the range {1, 128, 256, 512,
model in this part. We use the example of the Smart Grid dataset 1024} across different network settings. We observed nearly

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4546 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024

identical performance levels with minimal fluctuation (less than list compression in memory. Deliege and Pedersen [28] opti-
2%). This consistency is attributed to the ability of our batch mized space and performance for bitmap compression. Wang et
buffer to retain critical data without imposing additional strain al. [29] conducted an experimental study between bitmap and
on network transmission. inverted list compressions. Fang et al. [13] analyzed common
Performance on additional benchmark: CompressStreamDB compression algorithms, while Przymus and Kaczmarski [64]
can adapt to various benchmarks. We evaluate it on the TPC- explored how to select an optimal compression method for time
H benchmark to further demonstrate its potential. The TPC- series databases. Sprintz [91] introduces a four part composite
H benchmark contains one fact table and seven dimension compression algorithm for time-series data. Many works [86],
tables, with twenty-two standard queries. We adjust TPC-H [87] used hardware such as GPU and FPGA to optimize
and rewrite its queries for stream processing. We use the Q1, data compression. As for processing directly on compressed
Q2, and Q3 of its standard queries for evaluation. On aver- data [18], [19], [20], [21], [92], [93], [94], this technology can
age, CompressStreamDB achieves 2.18× throughput compared provide efficient storage and retrieval of data. For example, Chen
to our baseline, while the best single compression algorithm et al. [94] proposed a memory-efficient optimization approach
BD achieves 1.61× throughput. In terms of latency, Com- for large graph analytics, which compresses the intermediate
pressStreamDB can reduce the latency by 53.2% compared to vertex information with Huffman coding or bitmap coding and
baseline, by 26.2% compared to BD. queries on the partially decoded data or directly on the com-
Impact of parallelism: To investigate the impact of parallelism pressed data. Li et al. [95], [96], [97] presented compression
on performance, we conduct parallel throughput experiments on methods for very large databases, with aggregation operating
the cloud platform. On average, the parallel version achieves a directly on compressed datasets. Succinct [20] enables efficient
throughput improvement of 2.10% compared to the single-core queries directly on a compressed representation of data. Other
version. Although our cloud platform is equipped with an Intel works [18], [19], [98], [99] focused on the direct processing of
Xeon Platinum 8269CY CPU and supports 52 threads, the per- other compressed storage structures such as graphs. Different
formance improvement of parallelism is limited due to network from these studies, our work is the first fine-grained stream
transmission being a key factor affecting performance in our processing engine that can query compressed streams without
experimental network environment. decompression.

VIII. RELATED WORK IX. CONCLUSION


Data stream optimization: Due to the increasing demand for The demands on stream processing systems continue to
real-time processing of vast amounts of data, many studies have surge as the scale of streaming data expands. The growing
been devoted to optimize the performance of stream systems [3], volume presents considerable challenges in terms of both time
[15], [65], [77], [78], [79], [80]. To name a few, Koliousis et efficiency and resource utilization within these systems. We
al. [14] developed Saber, a window-based hybrid stream pro- propose CompressStreamDB, which applies compression al-
cessing for discrete CPU-GPU architectures. Zhang et al. [65] gorithms in stream processing to improve the system perfor-
introduced BriskStream, an in-memory data stream processing mance. In our implementation, CompressStreamDB integrates
system on shared-memory multi-core NUMA architectures. nine lightweight compression algorithms, significantly enhanc-
Zhang et al. [15] proposed a fine-grained query method of ing performance compared to operating without any compres-
stream processing on CPU-GPU integrated architectures. Zhang sion. Our experiments demonstrate that across four real-world
et al. [81] revisited the design of data stream processing systems datasets, CompressStreamDB achieves a substantial 3.84× in-
on multi-core processors. Scabbard [3] is a recently proposed crease in throughput while attaining a 68.0% reduction in latency
single node optimized stream processing engine focusing on and saving 68.7% space. Moreover, the throughput/price ratio
fault-tolerance aspect. Li et al. [79] proposed a framework on the edge platform outperforms that of the cloud platform by
called TRACE that allows compression on traffic monitoring 9.95×, while the throughput/power ratio on the edge is 7.32×
streams. Pekhimenko et al. [66] proposed TerseCades, adopting higher than that of the cloud.
an integer compression method and a floating-point number
compression method to enable direct processing on compressed
REFERENCES
data. However, none of them utilize the diversity of lightweight
data compression technology in stream processing and take [1] B. Del Monte et al., “Rhino: Efficient management of very large distributed
state for stream processing engines,” in Proc. ACM SIGMOD Int. Conf.
multi-layer transmission scenario with complex factors into Manage. Data, 2020, pp. 2471–2486.
consideration. [2] G. Van Dongen and D. Van den Poel, “Evaluation of stream process-
Processing on compressed data: CompressStreamDB’s direct ing frameworks,” IEEE Trans. Parallel Distrib. Syst., vol. 31, no. 8,
pp. 1845–1858, Aug. 2020.
SQL query processing on compressed data is a main feature [3] G. Theodorakis et al., “Scabbard: Single-node fault-tolerant stream pro-
that significantly reduce both time and space overhead in stream cessing,” Proc. VLDB Endowment, vol. 15, pp. 361–374, 2021.
processing. Data compression [13], [28], [29], [30], [64], [82], [4] A. R. M. Forkan et al., “AIoT-citysense: AI and IoT-driven city-scale
sensing for roadside infrastructure maintenance,” Data Sci. Eng., pp. 1–15,
[83], [84], [85], [86], [87], [88], [89], [90] has been proved to be 2023.
an effective approach to increase the bandwidth utilization and [5] State of IoT 2021, 2021. [Online]. Available: https://iot-analytics.com/
resolve the memory stalls. Wang et al. [30] developed inverted number-connected-iot-devices/

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4547

[6] W. Wingerath et al., “Real-time stream processing for Big Data,” Inf. [37] S. Zhang et al., “Parallelizing intra-window join on multicores: An exper-
Technol., vol. 58, pp. 186–194, 2016. imental study,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2021,
[7] B. Gedik et al., “SPADE: The system S declarative stream process- pp. 2089–2101.
ing engine,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2008, [38] H. Yar et al., “Towards smart home automation using IoT-enabled edge-
pp. 1123–1134. computing paradigm,” Sensors, vol. 21, p. 4932, 2021.
[8] M. Hirzel et al., “Stream processing languages in the Big Data era,” ACM [39] Z. Lv et al., “Intelligent edge computing based on machine learning
SIGMOD Rec., vol. 47, pp. 29–40, 2018. for smart city,” Future Gener. Comput. Syst., vol. 115, pp. 90–99,
[9] P. Deutsch et al., “GZIP file format specification version 4.3,” 1996. 2021.
[10] J. Ziv and A. Lempel, “A universal algorithm for sequential data compres- [40] E. Sisinni, A. Saifullah, S. Han, U. Jennehag, and M. Gidlund, “Industrial
sion,” IEEE Trans. Inf. Theory, vol. IT-23, no. 3, pp. 337–343, May 1977. Internet of Things: Challenges, opportunities, and directions,” IEEE Trans.
[11] P. Elias, “Universal codeword sets and representations of the integers,” Ind. Informat., vol. 14, no. 11, pp. 4724–4734, Nov. 2018.
IEEE Trans. Inf. Theory, vol. IT-21, no. 2, pp. 194–203, Mar. 1975. [41] J. Zhang, F.-Y. Wang, K. Wang, W. -H. Lin, X. Xu, and C. Chen, “Data-
[12] J. Ziv and A. Lempel, “Compression of individual sequences via variable- driven intelligent transportation systems: A survey,” IEEE Trans. Intell.
rate coding,” IEEE Trans. Inf. Theory, vol. IT-24, no. 5, pp. 530–536, Transp. Syst., vol. 12, no. 4, pp. 1624–1639, Dec. 2011.
Sep. 1978. [42] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile
[13] W. Fang, B. He, and Q. Luo, “Database compression on graphics proces- edge computing: The communication perspective,” IEEE Commun. Surv.
sors,” Proc. VLDB Endowment, vol. 3, pp. 670–680, 2010. Tuts., vol. 19, no. 4, pp. 2322–2358, Fourth Quarter 2017.
[14] A. Koliousis et al., “SABER: Window-based hybrid stream processing for [43] J. Chen and X. Ran, “Deep learning with edge computing: A review,”
heterogeneous architectures,” in Proc. ACM SIGMOD Int. Conf. Manage. Proc. IEEE, vol. 107, no. 8, pp. 1655–1674, Aug. 2019.
Data, 2016, pp. 555–569. [44] C.-H. Chen, M.-Y. Lin, and C.-C. Liu, “Edge computing gateway of the in-
[15] F. Zhang et al., “FineStream: Fine-grained window-based stream process- dustrial Internet of Things using multiple collaborative microcontrollers,”
ing on CPU-GPU integrated architectures,” in Proc. USENIX Conf. Usenix IEEE Netw., vol. 32, no. 1, pp. 24–32, Jan./Feb. 2018.
Annu. Tech. Conf., 2020, Art. no. 43. [45] B. Hussain, Q. Du, S. Zhang, A. Imran, and M. A. Imran,
[16] A third of the internet is just a copy of itself, 2013. [Online]. Available: “Mobile edge computing-based data-driven deep learning framework
https://www.businessinsider.com/ for anomaly detection,” IEEE Access, vol. 7, pp. 137656–137667,
[17] P. R. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, and 2019.
M. Seltzer, “Network-aware operator placement for stream-processing [46] S. Rajesh, V. Paul, V. G. Menon, S. Jacob, and P. Vinod, “Secure brain-
systems,” in Proc. 22nd Int. Conf. Data Eng., 2006, pp. 49–49. to-brain communication with edge computing for assisting post-stroke
[18] F. Zhang, J. Zhai, X. Shen, O. Mutlu, and X. Du, “Enabling efficient paralyzed patients,” IEEE Internet Things J., vol. 7, no. 4, pp. 2531–2538,
random access to hierarchically-compressed data,” in Proc. IEEE 36th Apr. 2020.
Int. Conf. Data Eng., 2020, pp. 1069–1080. [47] S. Aggarwal and S. Sharma, “Voice based deep learning enabled user
[19] F. Zhang et al., “Efficient document analytics on compressed data: Method, interface design for smart home application system,” in Proc. 2nd Int.
challenges, algorithms, insights,” Proc. VLDB Endowment, vol. 11, Conf. Commun. Comput. Ind. 4.0, 2021, pp. 1–6.
pp. 1522–1535, 2018. [48] S. Liu, L. Liu, J. Tang, B. Yu, Y. Wang, and W. Shi, “Edge computing for
[20] R. Agarwal, A. Khandelwal, and I. Stoica, “Succinct: Enabling queries autonomous driving: Opportunities and challenges,” Proc. IEEE, vol. 107,
on compressed data,” in Proc. 12th USENIX Conf. Netw. Syst. Des. no. 8, pp. 1697–1716, Aug. 2019.
Implementation, 2015, pp. 337–350. [49] A. Zilberman and L. Ice, “Why computer occupations are behind strong
[21] A. Khandelwal, “Queries on compressed data,” University of California, stem employment growth in the 2019–29 decade,” Computer, vol. 4, no. 5,
Berkeley, 2019. pp. 11–5, 2021.
[22] F. Zhang et al., “CompressDB: Enabling efficient compressed data direct [50] D. Park et al., “LiReD: A light-weight real-time fault detection system for
processing for various databases,” in Proc. ACM SIGMOD Int. Conf. edge computing using LSTM recurrent neural networks,” Sensors, vol. 18,
Manage. Data, 2022, pp. 1655–1669. p. 2110, 2018.
[23] Y. Zhang, F. Zhang, H. Li, S. Zhang, and X. Du, “CompressStreamDB: [51] X. Jiang, F. R. Yu, T. Song, and V. C. M. Leung, “A survey on multi-access
Fine-grained adaptive stream processing without decompression,” in Proc. edge computing applied to video streaming: Some research issues and
IEEE 39th Int. Conf. Data Eng., 2023, pp. 408–422. challenges,” IEEE Commun. Surveys Tuts., vol. 23, no. 2, pp. 871–903,
[24] P. A. Alsberg, “Space and time savings through large data base compres- Second Quarter 2021.
sion and dynamic restructuring,” Proc. IEEE, vol. 63, no. 8, pp. 1114–1122, [52] Z. Zhao et al., “IoT edge computing-enabled collaborative tracking system
Aug. 1975. for manufacturing resources in industrial park,” Adv. Eng. Inform., vol. 43,
[25] G. Pekhimenko, V. Seshadri, O. Mutlu, M. A. Kozuch, P. B. Gibbons, and T. 2020, Art. no. 101044.
C. Mowry, “Base-delta-immediate compression: Practical data compres- [53] P. Ranaweera, A. D. Jurcut, and M. Liyanage, “Survey on multi-access
sion for on-chip caches,” in Proc. 21st Int. Conf. Parallel Architectures edge computing security and privacy,” IEEE Commun. Surveys Tuts.,
Compilation Techn., 2012, pp. 377–388. vol. 23, no. 2, pp. 1078–1124, Second Quarter 2021.
[26] D. Abadi, S. Madden, and M. Ferreira, “Integrating compression and [54] H. Ziekow and Z. Jerzak, “The DEBS 2014 grand challenge,” in Proc. 8th
execution in column-oriented database systems,” in Proc. ACM SIGMOD ACM Int. Conf. Distrib. Event-Based Syst., 2014, pp. 266–269.
Int. Conf. Manage. Data, 2006, pp. 671–682. [55] Smart home statistics, 2021. [Online]. Available: https://www.statista.
[27] M. A. Roth and S. J. Van Horn, “Database compression,” ACM SIGMOD com/outlook/dmo/smart-home/united-states
Rec., vol. 22, pp. 31–39, 1993. [56] A. Arasu et al., “Linear road: A stream data management benchmark,” in
[28] F. Deliège and T. B. Pedersen, “Position list word aligned hybrid: Opti- Proc. 30th Int. Conf. Very Large Data Bases, 2004, pp. 480–491.
mizing space and performance for compressed bitmaps,” in Proc. 13th Int. [57] More Google cluster data, 2011. [Online]. Available: https://ai.googleblog.
Conf. Extending Database Technol., 2010, pp. 228–239. com/2011/11/more-google-cluster-data.html
[29] J. Wang et al., “An experimental study of bitmap compression vs. inverted [58] V. Gulisano et al., “The DEBS 2017 grand challenge,” in Proc. 11th ACM
list compression,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2017, Int. Conf. Distrib. Event-Based Syst., 2017, pp. 271–273.
pp. 993–1008. [59] V. Gulisano et al., “The DEBS 2018 grand challenge,” in Proc. 12th ACM
[30] J. Wang et al., “MILC: Inverted list compression in memory,” Proc. VLDB Int. Conf. Distrib. Event-Based Syst., 2018, pp. 191–194.
Endowment, vol. 10, pp. 853–864, 2017. [60] C. Mutschler, H. Ziekow, and Z. Jerzak, “The DEBS 2013 grand chal-
[31] R. Stephens, “A survey of stream processing,” Acta Inform., vol. 34, lenge,” in Proc. 7th ACM Int. Conf. Distrib. Event-Based Syst., 2013,
pp. 491–541, 1997. pp. 289–294.
[32] P. Córdova, “Analysis of real time stream processing systems considering [61] A. Shanbhag, S. Madden, and X. Yu, “A study of the fundamental perfor-
latency,” University of Toronto, 2015. mance characteristics of GPUs and CPUs for database analytics,” in Proc.
[33] M. H. Ali et al., “Microsoft CEP server and online behavioral targeting,” ACM SIGMOD Int. Conf. Manage. Data, 2020, pp. 1617–1632.
Proc. VLDB Endowment, vol. 2, pp. 1558–1561, 2009. [62] T. Neumann, “Efficiently compiling efficient query plans for modern
[34] Apache storm, 2021. [Online]. Available: http://storm.apache.org hardware,” Proc. VLDB Endowment, vol. 4, pp. 539–550, 2011.
[35] Apache flink, 2021. [Online]. Available: http://flink.apache.org [63] P. A. Boncz, M. Zukowski, and N. Nes, “MonetDB/X100: Hyper-
[36] D. A. Huffman, “A method for the construction of minimum-redundancy pipelining query execution,” in Proc. Conf. Innov. Data Syst. Res., 2005,
codes,” Proc. IEEE, vol. 40, no. 9, pp. 1098–1101, Sep. 1952. pp. 225–237.

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
4548 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 9, SEPTEMBER 2024

[64] P. Przymus and K. Kaczmarski, “Compression planner for time series [92] Z. Pan et al., “Exploring data analytics without decompression on em-
database with GPU support,” in Transactions on Large-Scale Data-and bedded GPU systems,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 7,
Knowledge-Centered Systems XV, Berlin, Germany: Springer, 2014. pp. 1553–1568, Jul. 2022.
[65] S. Zhang et al., “BriskStream: Scaling data stream processing on shared- [93] F. Zhang et al., “TADOC: Text analytics directly on compression,” VLDB
memory multicore architectures,” in Proc. ACM SIGMOD Int. Conf. J., vol. 30, pp. 163–188, 2021.
Manage. Data, 2019, pp. 705–722. [94] X. Chen et al., “HBMax: Optimizing memory efficiency for parallel influ-
[66] G. Pekhimenko et al., “Tersecades: Efficient data compression in stream ence maximization on multicore architectures,” 2022, arXiv:2208.00613.
processing,” in Proc. USENIX Conf. Usenix Annu. Tech. Conf., 2018, [95] J. Li, D. Rotem, and H. K. Wong, “A new compression method with fast
pp. 307–320. searching on large databases,” in Proc. 13th Int. Conf. Very Large Data
[67] C. Guo et al., “Pingmesh: A large-scale system for data center network Bases, 1987, pp. 311–318.
latency measurement and analysis,” in Proc. ACM Conf. Special Int. Group [96] J. Li, D. Rotem, and J. Srivastava, “Aggregation algorithms for very large
Data Commun., 2015, pp. 139–152. compressed data warehouses,” in Proc. 25th Int. Conf. Very Large Data
[68] Raspberry pi 4 model b, 2022. [Online]. Available: https://www. Bases, 1999, pp. 651–662.
raspberrypi.com/products/raspberry-pi-4-model-b/ [97] J. Li and J. Srivastava, “Efficient aggregation algorithms for compressed
[69] Raspberry pi dramble: Power consumption benchmarks, 2022. [On- data warehouses,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 3, pp. 515–
line]. Available: https://www.pidramble.com/wiki/benchmarks/power- 529, May/Jun. 2002.
consumption [98] W. Fan et al., “Query preserving graph compression,” in Proc. ACM
[70] R. Castro Fernandez et al., “Integrating scale out and fault tolerance SIGMOD Int. Conf. Manage. Data, 2012, pp. 157–168.
in stream processing using operator state management,” in Proc. ACM [99] H. Maserrat and J. Pei, “Neighbor query friendly compression of social
SIGMOD Int. Conf. Manage. Data, 2013, pp. 725–736. networks,” in Proc. 16th ACM SIGKDD Int. Conf. Knowl. Discov. Data
[71] I. S. Moreno, P. Garraghan, P. Townend, and J. Xu, “Analysis, modeling Mining, 2010, pp. 533–542.
and simulation of workload patterns in a large-scale utility cloud,” IEEE
Trans. Cloud Comput., vol. 2, no. 2, pp. 208–221, Second Quarter 2014.
[72] H. Funke et al., “Pipelined query processing in coprocessor environments,”
in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2018, pp. 1603–1618. Yu Zhang received the bachelor’s degree from the
[73] J. Li et al., “HippogriffDB: Balancing I/O and GPU bandwidth in Big Data Department of Computer Science and Technology,
analytics,” Proc. VLDB Endowment, vol. 9, pp. 1647–1658, 2016. Tsinghua University, in 2021. He is currently working
[74] K. Wang et al., “Concurrent analytical query processing with GPUs,” Proc. toward the PhD degree with DEKE Lab and School
VLDB Endowment, vol. 7, pp. 1011–1022, 2014. of Information, Renmin University of China. His
[75] Y. Yuan, R. Lee, and X. Zhang, “The Yin and Yang of processing data major research interests include database systems and
warehousing queries on GPU devices,” Proc. VLDB Endowment, vol. 6, parallel computing.
pp. 817–828, 2013.
[76] P. O’Neil et al., “The star schema benchmark and augmented fact table
indexing,” in Proc. Technol. Conf. Perform. Eval. Benchmarking, 2009,
pp. 237–252.
[77] X. Ren et al., “LDP-IDS: Local differential privacy for infinite data
streams,” 2022, arXiv:2204.00526.
[78] B. Zhao et al., “EIRES: Efficient integration of remote data in event stream Feng Zhang received the bachelor’s degree from
processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2021, Xidian University, in 2012, and the PhD degree in
pp. 2128–2141. computer science from Tsinghua University, in 2017.
[79] T. Li et al., “Trace: Real-time compression of streaming trajectories in He is a professor with DEKE Lab and School of
road networks,” Proc. VLDB Endowment, vol. 14, pp. 1175–1187, 2021. Information, Renmin University of China. His ma-
[80] Y. Zhou, A. Salehi, and K. Aberer, “Scalable delivery of stream query jor research interests include database systems, and
result,” in Proc. VLDB Endowment, vol. 2, pp. 49–60, 2009. parallel and distributed systems.
[81] S. Zhang, B. He, D. Dahlmeier, A. C. Zhou, and T. Heinze, “Revisiting
the design of data stream processing systems on multi-core processors,”
in Proc. IEEE 33rd Int. Conf. Data Eng., 2017, pp. 659–670.
[82] J. He, S. Zhang, and B. He, “In-cache query co-processing on coupled
CPU-GPU architectures,” Proc. VLDB Endowment, vol. 8, pp. 329–340,
2014.
[83] K. Sayood, Introduction to Data Compression. Burlington, MA, USA: Hourun Li is a research assistant with the Key Labo-
Morgan Kaufmann, 2017. ratory of Data Engineering and Knowledge Engineer-
[84] D. A. Lelewer and D. S. Hirschberg, “Data compression,” ACM Comput. ing (MOE), Renmin University of China. He joined
Surv., vol. 19, pp. 261–296, 1987. the Key Laboratory of Data Engineering and Knowl-
[85] C. Lin, “Accelerating analytic queries on compressed data,” University of edge Engineering (MOE), in 2020. His major research
California, San Diego, 2018. interests include database systems, and parallel and
[86] C. Rivera et al., “Optimizing huffman decoding for error-bounded lossy distributed systems.
compression on GPUs,” 2022, arXiv:2201.09118.
[87] J. Tian et al., “Optimizing error-bounded lossy compression for scien-
tific data on GPUs,” in Proc. IEEE Int. Conf. Cluster Comput., 2021,
pp. 283–293.
[88] F. Zhang, J. Zhai, X. Shen, O. Mutlu, and X. Du, “POCLib: A high-
performance framework for enabling near orthogonal processing on com-
pression,” IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 2, pp. 459–475,
Shuhao Zhang received the bachelor’s degree in
Feb. 2022.
computer engineering from Nanyang Technological
[89] F. Zhang et al., “Zwift: A programming framework for high performance
text analytics on compressed data,” in Proc. Int. Conf. Supercomputing, University, in 2014, and the PhD degree in computer
science from the National University of Singapore,
2018, pp. 195–206.
in 2019. He is currently an assistant professor with
[90] X. Huang et al., “Meaningful image encryption algorithm based on com-
Nanyang Technological University. His research in-
pressive sensing and integer wavelet transform,” Front. Comput. Sci.,
vol. 17, no. 3, 2023, Art. no. 173804. terests include high performance computing, stream
processing systems, and database system.
[91] D. Blalock, S. Madden, and J. Guttag, “Sprintz: Time series compression
for the Internet of Things,” Proc. ACM Interactive Mobile Wearable
Ubiquitous Technol., vol. 2, 2018, Art. no. 93.

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: DATA-AWARE ADAPTIVE COMPRESSION FOR STREAM PROCESSING 4549

Xiaoguang Guo is a research assistant with the Key Anqun Pan is the technical director with the Database
Laboratory of Data Engineering and Knowledge En- R&D Department, Tencent, in China. With more than
gineering (MOE), Renmin University of China. He 15 years of experience, he has specialized in the
joined the Key Laboratory of Data Engineering and research and development of distributed computing
Knowledge Engineering (MOE), in 2020. His major and storage systems. Currently, he is responsible for
research interests include database systems and dis- steering the research and development of the Tencent
tributed systems. distributed database system (TDSQL).

Yuxing Chen received the PhD degree in computer Xiaoyong Du received the BS degree from Hangzhou
science from the University of Helsinki, Finland, in University, Zhejiang, China, in 1983, the ME degree
2021. He currently works as a senior research engi- from the Renmin University of China, Beijing, China,
neer with the Database R&D Department, Tencent, in 1988, and the PhD degree from the Nagoya Insti-
China. His research interests focus on database per- tute of Technology, Nagoya, Japan, in 1997. He is
formance and evaluation, HTAP database design, and currently a professor with the School of Information,
distributed system design. Renmin University of China. His current research in-
terests include databases and intelligent information
retrieval.

Authorized licensed use limited to: Wenzhou University. Downloaded on December 02,2024 at 05:57:16 UTC from IEEE Xplore. Restrictions apply.

You might also like