0% found this document useful (0 votes)
76 views14 pages

Evaluation of Stream Processing Frameworks

This paper evaluates and compares the performance of four stream processing frameworks: Spark Streaming, Flink, Structured Streaming, and Kafka Streams. The authors define different workloads and tuning parameters to measure latency, throughput, and resource usage under various scenarios such as bursty workloads. They analyze how adding common operations affects performance and aim to provide guidance on which framework is best suited for different use cases.

Uploaded by

Aan Nur Wahidi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views14 pages

Evaluation of Stream Processing Frameworks

This paper evaluates and compares the performance of four stream processing frameworks: Spark Streaming, Flink, Structured Streaming, and Kafka Streams. The authors define different workloads and tuning parameters to measure latency, throughput, and resource usage under various scenarios such as bursty workloads. They analyze how adding common operations affects performance and aim to provide guidance on which framework is best suited for different use cases.

Uploaded by

Aan Nur Wahidi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO.

8, AUGUST 2020 1845

Evaluation of Stream Processing Frameworks


Giselle van Dongen , Member, IEEE and Dirk Van den Poel , Senior Member, IEEE

Abstract—The increasing need for real-time insights in data sparked the development of multiple stream processing frameworks. Several
benchmarking studies were conducted in an effort to form guidelines for identifying the most appropriate framework for a use case. In this
article, we extend this research and present the results gathered. In addition to Spark Streaming and Flink, we also include the emerging
frameworks Structured Streaming and Kafka Streams. We define four workloads with custom parameter tuning. Each of these is optimized
for a certain metric or for measuring performance under specific scenarios such as bursty workloads. We analyze the relationship between
latency, throughput, and resource consumption and we measure the performance impact of adding different common operations to the
pipeline. To ensure correct latency measurements, we use a single Kafka broker. Our results show that the latency disadvantages of using
a micro-batch system are most apparent for stateless operations. With more complex pipelines, customized implementations can give
event-driven frameworks a large latency advantage. Due to its micro-batch architecture, Structured Streaming can handle very high
throughput at the cost of high latency. Under tight latency SLAs, Flink sustains the highest throughput. Additionally, Flink shows the least
performance degradation when confronted with periodic bursts of data. When a burst of data needs to be processed right after startup,
however, micro-batch systems catch up faster while event-driven systems output the first events sooner.

Index Terms—Apache spark, structured streaming, apache flink, apache kafka, kafka streams, distributed computing, stream processing
frameworks, benchmarking, big data

1 INTRODUCTION
N RESPONSE to the increasing need for fast, reliable and accu- evaluation of which framework is most suitable for which use
I rate answers to data questions, many stream processing
frameworks were developed. Each use case, however, has its
case. In short, we make the following contributions:

own performance requirements and data characteristics. Sev- 1) Open-source benchmark and extensive analysis of
eral benchmarks were conducted in an effort to shed light on emerging frameworks, Structured Streaming and
which frameworks perform best in which scenario, e.g., [1], Kafka Streams, besides newer releases of Flink and
[2], [3], [4], [5]. Previous work often benchmarked Flink and Spark Streaming.
Spark Streaming. We extend this work by comparing later 2) Accurate latency measurement of pipelines of differ-
releases of these frameworks with two emerging frameworks ent complexities for all frameworks by capturing
being Structured Streaming, which has shown promising time on a single Kafka broker.
results [6], and Kafka Streams, which has been adopted by 3) Thorough analysis of relationships between latency,
large IT companies such as Zalando and Pinterest. We take a throughput and resource consumption for each of
two-pronged approach of comparing stream processing the frameworks and under different processing pipe-
frameworks on different workloads and scenarios. In the first lines, workloads and throughput levels.
place, we assess the performance of different frameworks on 4) Dedicated parameter tuning per workload for more
similar operations. We look at relationships between metrics nuanced performance evaluations with detailed rea-
such as latency, throughput and resource utilization for dif- soning behind each of the settings.
ferent processing scenarios. The parameter configurations of 5) Inclusion of built-in and customized stateful operators.
a framework have a large influence on the performance in a 6) Guidelines for choosing the right tool for a use case
certain processing scenario. Therefore, we tune parameters by taking into account inter-framework differences,
per workload. Additionally, we evaluate the performance different pipeline complexities, implementation flex-
impact of adding different, complex operations to the work- ibility and data characteristics.
flow of each framework. Understanding the impact of an The code for this benchmark can be found at https://
operation on the performance of a pipeline, is a valuable github.com/Klarrio/open-stream-processing-benchmark.
insight in the design phase of the processing flow. By combin- The rest of this paper is organized as follows. The next sec-
ing these two approaches, we can form a more nuanced tion gives an overview of the related work that formed a
basis for this study. In Section 3 we describe the setup that
was used to conduct this benchmark. Section 4 dives deeper
 The authors are with Ghent University, 9000 Ghent, Belgium. into the configurations used for the different frameworks. A
E-mail: {Giselle.vanDongen, Dirk.VandenPoel}@ugent.be. discussion of the results follows in Section 5, followed by gen-
Manuscript received 16 Aug. 2019; revised 10 Jan. 2020; accepted 18 Feb. 2020. eral conclusions. Finally, we outline some of the limitations
Date of publication 5 Mar. 2020; date of current version 30 Mar. 2020. and opportunities for further research. Additional informa-
(Corresponding author: Giselle van Dongen.)
Recommended for acceptance by Henri E. Bal. tion on how this benchmark was conducted can be found in
Digital Object Identifier no. 10.1109/TPDS.2020.2978480 the Supplemental File, which can be found on the Computer
1045-9219 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1846 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

Society Digital Library at http://doi.ieeecomputersociety. data. Most research, however, merely analyses the maxi-
org/10.1109/TPDS.2020.2978480. mum throughput that can be sustained for a prolonged
period of time, e.g., [1]. In this benchmark, we analyze
throughput in two additional scenarios. First, we investigate
2 RELATED WORK the behavior of the framework under a bursty workload. Sec-
Several benchmarking studies have been conducted over the ond, we monitor the behavior of the framework when it
last years. In this section, we outline the most important dif- needs to process a big burst of data right after startup. Some
ferences. A tabular overview has been given in Section 1 of papers already studied the behavior of stream processing
the Supplemental File, available online. Most of these bench- frameworks under bursty data. Shukla and Simmhan [8],
marks were implemented on one or a combination of the fol- and later RIoTBench [9], studied the behavior of Storm under
lowing frameworks: Spark Streaming, Storm and Flink. In bi-modal data rates with peaks in mornings and evenings,
[4] and [5], Spark Streaming outperforms Storm in peak while Karimov et al. [1] studied the effects of a fluctuating
throughput measurements and resource utilization but workload for Spark, Flink and Storm. Spark and Flink
Storm attains lower latencies. The Yahoo benchmark [3] con- showed similar behavior in the case of windowed aggrega-
firmed that under high throughput Storm struggles. tions, while Flink handled windowed joins better than Spark.
Karakaya et al. [2] extended the Yahoo Benchmark and found Storm was the most susceptible to bottlenecks due to data
that Flink 0.10.1 outperforms Spark Streaming 1.5.0 and bursts. Most other literature studied constant rate workloads
Storm 0.10.0 for data-intensive applications. However, in under different data scales, e.g., [2], [3], [4], [5]. To be able to
applications where high latency is acceptable, Spark outper- generate these different loads, we use temporal and spatial
forms Flink. Karimov et al. [1] confirmed that the general per- scaling similar to [8] and [9].
formance of Spark Streaming 2.0.1 and Flink 1.1.3 is better StreamBench [4] and Qian et al. [5] studied the effects of a
than that of Storm. In this work, we include the latest ver- failing node on throughput and latency. Their results indi-
sions of Spark Streaming 2.4.1 and Flink 1.9.1 due to their cated no significant impact for Spark Streaming while Storm
superior results. Additionally, we include Spark’s new Struc- experienced a large performance decrease. In this work, in
tured Streaming 2.4.1 API and Kafka Streams 2.1.0, which contrast, we investigate the behavior of processing a big
have not been thoroughly benchmarked before. Micro-batch burst of data at startup, mimicking the behavior triggered
systems as well as event-driven frameworks are represented when the job would be down for a period of time.
in this benchmark. Both Spark frameworks are micro-batch This benchmark uses Apache Kafka [11] as distributed
systems [7], meaning that they buffer events on the receiver- publish-subscribe messaging system to decouple input gen-
side and process them in small batches. Kafka Streams and erators from processors. In the Kafka cluster, streams of
Flink work event-driven and only apply buffering when data are made available on partitioned topics. The majority
events need to be send over the network. of past benchmarking literature uses Apache Kafka for this
Previous work mainly focused on benchmarking end-to- purpose as this is representative of real-world usage, e.g.,
end applications in areas such as ETL, descriptive statistics [2], [3], [8], [9], [10]. Furthermore, frameworks, e.g., [6], [12],
and IO, e.g., [2], [3], [8], [9]. Some work has been conducted require a replayable, durable data source such as Kafka to
on benchmarking single operations such as parsing [8], fil- give exactly-once processing guarantees.
tering [4], [8], [9] and windowed aggregations [1]. The met- Finally, the configuration parameters of a framework
rics for these single operations were, however, recorded as have large effects on latency and throughput. Previous
if it were end-to-end applications by including ingesting, research often did not sufficiently document the configura-
buffering, networking, serialization, etc. In this work, we tion settings or used default values for all workloads. In this
also measure the impact of adding a certain operation to a work, framework parameters are tuned separately for each
pipeline. We ensured that each of these operations consti- workload to get a more accurate view of the performance of
tutes a seperate stage in the DAG of the job. We based the a framework on a certain metric or in a certain scenario. A
design of our pipeline on the Yahoo benchmark [3], which thorough discussion follows in Section 4, supported by extra
has also been adopted by [2]. Besides a tumbling window, material in the Supplemental File, available online.
we also include a sliding window for the latency measure-
ment workload. For these types of stateful transformations,
3 BENCHMARK DESIGN
some frameworks offer more flexibility via low-level APIs,
custom triggers and operator access to managed state. In In this section, we describe the benchmark setup. First, we
this work we implemented the tumbling and sliding win- introduce the input data source. Next, we elaborate on the
dow with built-in as well as low-level customized operators operations that were included and the metrics that are mon-
for Flink, Kafka Streams and Structured Streaming to inves- itored. Finally, we describe the different workloads and the
tigate the advantages of using this flexibility. architecture on which the processing jobs are run.
Two of the most important performance indicators in
stream processing systems are latency and throughput, 3.1 Data
which have been included in most previous benchmarks, We perform the benchmark on data from the IoT domain
e.g., [1], [3], [8], [9]. We use the same methodology to mea- because of its increasing popularity and its increasing
sure latency of single operations as in the initial proposal for amount of use cases. We use traffic sensor data from the
this study [10]. The ability of a framework to process events Netherlands provided by NDW (Nationale Databank Weg-
faster and in bigger quantities with a similar setup, leads to verkeersgegevens). This data contains measurements of the
cost reductions and a greater ability to overcome bursts of number of cars (flow) and their average speed at around

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1847

flexible API is an important asset when optimizing pipeline


performance, we included multiple implementations of the
stateful operators. For the joining stage of the processing
pipeline, we use the most appropriate join semantic available
in the framework. For Flink and Kafka Streams, we use an
inner interval join [13]. This type of join permits joining
events with timestamps that lie in a relative time interval to
Fig. 1. Processing flow based on [10].
each other, which is precisely what we want to do. This type
of join has a much lower latency than the tumbling window
60,000 measurement points in the Netherlands and this for implementation that is typical for micro-batch systems since
every minute of every lane of the road. The data publisher it can output events directly after receiving the entire pair for
publishes a subset of this data set as a constant rate data the join and it doesn’t have to wait for the tumbling of the
stream on the Kafka input topics. For periodic burst work- window.
loads, every ten seconds an enlarged batch is published. The Flink API offers flexibility for stateful operations such
The volume of data can be inflated by a configurable factor as the ability to use different state backends [14], to define
by using spatial scaling, similar to [9]. By changing the char- custom window triggers [12] or to use low-level as well as
acteristics of the data, we mimic different datasets, thereby high-level APIs to do stateful operations [15]. We implement
generalizing the benchmark to other use cases. the tumbling window stage with the default event time trig-
ger as well as with a custom trigger that triggers computation
3.2 Processing Pipeline when enough data has arrived to do the computation. Fur-
A stream processing benchmark should be able to test the thermore, the sliding window stage was implemented with a
impact of different types of operations. Therefore, we run built-in sliding window, as well as with the low-level proces-
each workload for different pipeline complexities. We do sor API which gives direct access to managed keyed state
this by using an extensible processing pipeline (Fig. 1) with and timers [15]. Instead of using a window buffer to compute
operations similar to [3] and [10]. the change in speed and flow over the last seconds, we man-
age our own list state to do the computation and generate out-
1) Ingest: Reading data from the Kafka flow and speed put as soon as an observation enters the processor function.
topics: no transformations are done on the data. This has the effect that each event can be processed instantly.
2) Parse: Parsing the JSON flow and speed messages. Kafka Streams offers two levels of APIs: a high-level
3) Stream-stream join: inner join of flow and speed Domain Specific Language (DSL) and a low-level Streams
messages with the same timestamp, measurement Processor API. In this benchmark, we implemented the
point and lane. pipeline with the Kafka Streams DSL for the stateless stages.
4) Tumbling window: Adding up the number of cars For the stateful stages we studied two implementations: one
that passed by in the last second and averaging their using the DSL windowing functionality and the other using
speed for all the lanes of the road. the low-level processor API. The processor API allows us to
5) Extension: Sliding window: Only executed for the interact with state stores and build customized processing
latency measurement workload. Computes the rela- logic. To implement the aggregation and relative change
tive change in flow and speed for each measurement phases, we call the processor API from the DSL by supply-
location over the last two and three seconds. The ing a transformer as described in the documentation [16].
window length is three seconds and the slide inter- For both stages, we use a persistent key value store backed
val is one second. by RocksDB to store state. RocksDB is the default state back-
Initially, we execute the pipeline up to and including the end in Kafka Streams. By using customized low-level imple-
ingest stage. We use the performance of this stage as a base- mentations we can make sure events are sent out as soon as
line. In the second run, we add the parse stage to the pipeline. possible, reaching lower latencies.
This mimics ETL jobs with simple data transformations. Finally, we also include an alternative implementation
Afterwards we add a join operation, common in data enrich- for Structured Streaming. When we use built-in aggrega-
ment scenarios. Finally, we add window stages and custom- tions such as tumbling window, we experienced issues with
ized stateful operations for testing more complex analytics the propagation of watermarks leading to very high laten-
capabilities. Intermediate stages do not publish their outputs cies and making it impossible to chain aggregations, as dis-
to Kafka. cussed in Section 5.1. Therefore, we also implemented these
All stateful operations (joining and windowing) are done stages with mapping functions with custom state manage-
on event time. For this, we use the Kafka log append time- ment. In this approach, we have fine-grained control over
stamp of the input observation. This is possible since the state and the publishing of output so we do not rely on the
input stream is never out of order. Spark Streaming does propagation of watermarks.
not offer event time processing characteristics, therefore The operations discussed in this section cover the main
some extra logic is required to handle windows accurately building blocks of stream processing pipelines for most use
in the case of bursty data. cases, as can be inferred from the documentation of these
Event-driven frameworks do not apply buffering on the frameworks [14], [16], [17], [18]: basic operations, joining,
receiver side and can therefore, have significant latency windowing and processing functions. We believe that by
advantages. This advantage disappears when built-in stateful including these operations, we can provide a general perfor-
operators such as tumbling windows are used. Since a mance assessment of these frameworks.

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1848 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

3.3 Metrics average CPU levels above 80 percent as unhealthy due to


We collect four metrics for every run: the latency of each limited resources that are available to the JVM to maintain
message, the throughput per second and the memory and the running processes when unforeseen bursts of data occur.
CPU utilization of the workers. For each run, except for the We monitor CPU at the container level using cAdvisor [19].
single burst workload, we ignore the first five minutes of
data since these initial minutes often show increased latency 3.4 Workloads
and CPU due to the start up. Stream processing frameworks need to be able to handle dif-
ferent data characteristics and processing scenarios: bursty
3.3.1 Latency workloads, constant rate workloads with different through-
Latency is the amount of time the processing framework put levels, catching up after downtime. We define specific
requires to generate an output message from one or multiple workloads for each of these scenarios. Furthermore, we
input messages. This can be difficult to measure since distrib- define separate optimized workloads for measuring latency
uted systems do not have a global notion of time. and throughput since they influence each other, e.g., via
The approach we follow to counter this is described in buffer sizes and timeouts [12].
Section 3.4.1. We unify the way latency is measured across
frameworks by subtracting the Kafka log append time- 3.4.1 Workload for Latency Measurement
stamps of the output and input message. We do not rely on The first workload is the one designed for measuring latency
the internal metrics of the frameworks since this does not in optimal circumstances. Latency is measured by subtract-
guarantee us a uniform definition of latency. Furthermore, ing the Kafka log append time of the output and input mes-
our definition of latency includes the time the event resided sage. Kafka is a distributed service and these timestamps are
on Kafka before the framework started processing the obser- appended by the leading broker of the topic partition when
vation. When backpressure kicks in, this difference between the message arrives. The leading brokers for the input and
event time latency and processing-time latency can become output partitions can reside on different machines. Time-of-
significant [1]. In the case where multiple input messages are day clocks can demonstrate noticeable time differences
required for one output message, we use the timestamp of across machines. This makes computations with timestamps
the latest input message that was required to do the compu- from different brokers inaccurate. In an experiment with five
tation, as proposed by [1]. brokers, we clocked negative latencies of up to 55 ms for up
to ten percent of the events for the earlier stages of the pipe-
3.3.2 Throughput line. Network Time Protocol (NTP) can be used to synchro-
nize time-of-day clocks but still has a minimum error of
We define throughput as the number of messages a frame-
35 ms [20]. Due to this, time-of-day clocks are not suitable to
work processes per second. We measure throughput in three
measure latencies of less than 10 ms. To counter this, we use
scenarios: peak sustainable throughput under constant load,
a single Kafka broker and we compare the timestamps that
burst throughput at startup and throughput under periodic
were attached to the input and output message by that single
bursts. Peak sustainable throughput expresses the maximum
Kafka broker [10]. This ensures that all time recordings are
throughput that a framework can sustain over an extended
done on the same machine and therefore, based on the same
period of time without becoming unstable. We consider a sys-
clock. Log append time, i.e., the wall clock time of the broker,
tem to be stable when the latency and queue size on the input
is accurate up to a millisecond level, which is sufficient for
buffer do not continuously increase [1] and when average
our measurements [21].
CPU levels are within reasonable bounds, i.e., lower than 80
This workload is used to determine the lowest obtainable
percent. Since this benchmark focuses on (near) real-time
latency. To eliminate the influence throughput has on
processing, we do not allow median latencies to go above 10
latency, we use a very low throughput to prevent stressing
seconds. In addition, we measure the burst throughput at
the framework and the single Kafka broker. In the next
startup to capture the speed at which a framework can make
workload we gradually increase the load to see how latency
up for an incurred lag. Finally, we compare input and output
holds up as throughput grows. Data flows in at a constant
throughput for workloads with periodic bursts.
rate of 400 messages per second throughout the entire run.
This equals a volume of around 135 MB in one run of thirty
3.3.3 Memory Utilization minutes. We run this workload for different pipeline com-
We collect the heap memory usage of the JVMs that run the plexities to be able to see its impact on latency and we add
processing jobs by scraping the JMX metrics exposed by the an additional sliding window stage to the pipeline.
jobs. The effect on CPU usage of enabling JMX is negligible.
Heap memory is used, by all frameworks, to store state and 3.4.2 Workload for Sustainable Throughput
do computations. However, memory management is intrin- Measurement
sically different for each framework, as will be evident from
Whereas the previous workload focuses on determining the
the results.
lowest obtainable latency, this workload investigates how
this latency evolves when the framework is used for more
3.3.4 CPU Utilization realistic and considerable loads and to determine the peak
Finally, we look at CPU utilization or the ability to evenly sustainable throughput. Additionally, we use this workload
spread work across the different slaves as well as the proc- to investigate the behavior of the frameworks under a con-
essing power required to do specific operations. We consider stant data input rate which is common for use cases such as

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1849

monitoring systems (e.g., application logs), preventive main- the benchmark components and the EC2 instances on which
tenance workloads or connected devices (e.g., solar panels). they run. All benchmark components run in Docker contain-
We determine the peak sustainable throughput by run- ers on DC/OS. To mitigate the potential performance differ-
ning the workload repeatedly for increasing data scales and ences between EC2 instances, we run each benchmark run
monitoring latency, throughput and CPU. For latency, we five times on different clusters and select the run with median
check whether the median latency does not increase mono- performance.
tonically throughout the run and remains below 10 seconds. As a message broker, we use Apache Kafka. For most
When the volume of data is higher than the framework can runs, the cluster consists of five brokers (2 vCPU, 6 GB RAM)
process, latency increases constantly throughout the work- which is the same number as the number of workers for each
load due to processing delays. Furthermore, we also check of the frameworks. As described in Section 3.4.1, we use a
whether the framework has processed all events by the time single Kafka broker (6.5 vCPU, 24 GB RAM) for the latency
the stream ends. In this work, we assume that once the input measurement workload. We use Kafka Manager for manag-
stream halts, the framework needs to be able to finish proc- ing the cluster and creating the input and output topics, each
essing in less than ten seconds, otherwise it was either lag- of these with 20 partitions since each framework cluster has
ging behind on the input data stream or batching at intervals 20 processing cores. Messages are not replicated to avoid cre-
higher than 10 seconds. Finally, we monitor whether the ating too much load on the Kafka brokers. The input topics
average CPU utilization of the framework containers does for the speed and flow messages are partitioned on the ID
not exceed 80 percent. since this is the key for joining and aggregating. Addition-
ally, all brokers are configured to use LogAppendTime for
3.4.3 Workload With Burst at Startup latency measurements. The roundtrip latency of the Kafka
The third workload measures the peak burst throughput that cluster is around 1 ms, tested by publishing and directly con-
a framework can handle right after startup. This mimics the suming the message again. We use cAdvisor, backed by
effect of a job trying to catch up with an incurred delay. For InfluxDB, to monitor and store CPU and networking metrics
this, we start up the data publisher before the job has been of the framework containers. Concurrently, the processing
started and let it publish around 6,000 events per second, jobs expose heap usage metrics on an endpoint via JMX,
which is a throughput level that all frameworks are able to which is then scraped and stored by a JMX collector. An
architecture diagram and detailed information on the distri-
handle but still imposes a considerable load. Five minutes
bution of resources over all services has been listed in
later, we bring up the processing job and start processing the
Section 3 of the Supplemental File, available online. Without
data on the Kafka topics from the earliest offset. This equals
the framework clusters, the DC/OS cluster has 45 percent
around 1 800 000 events or 350 MB, with linearly increasing
CPU allocation and 33 percent memory allocation. The
event times. Once the processing job is running, we reduce
the throughput to around 400 messages per second to let framework cluster adds another 28 percent CPU allocation
event time progress but not put any more load on the frame- and 35 percent memory allocation.
work. We let this processing job run for ten minutes. After-
wards, we evaluate the throughput at which the framework 4 FRAMEWORKS
was able to process the initial burst of data.
In this section, we discuss each of the frameworks and their
configurations, as listed in Table 1. We only discuss the con-
3.4.4 Workload With Periodic Bursts figuration parameters for which we do not use the default
Finally, we want to monitor the ability of the framework to values. Parameter tuning is done per workload to get a more
overcome bursts in the input stream. This mimics use cases accurate measurement of performance. Configurations were
such as processing data from a group of coordinated sensors chosen based on the documentation of the frameworks,
or from connected devices that upload bursts of data every expert advice, and an empirical study, which can be further
few hours or use cases which need to be able to sustain peaks consulted in the Supplemental File in Section 4, available
of data such as web log processing. We have a constant online.
stream of a very low volume and every ten seconds there is a Some settings are equal for all frameworks. For the frame-
large burst of data. Each burst of data contains approxi- works that use a master-slave architecture, i.e., Flink and
mately 38 000 events, which is approximately 7.5 MB. The both Spark frameworks, we deploy a standalone cluster with
publishing of one burst takes around 170 to 180 ms. We look one master and five workers in Docker containers. Kafka
at how long latency persists at an inflated level after a large Streams runs with five instances. Each framework gets the
burst and at the effects on CPU and memory utilization. same amount of resources for their master (2 vCPU, 8 GB
RAM) and slaves (4 vCPU, 20 GB RAM each). Parallelism is
3.5 Architecture set to 20 since we have 20 Kafka partitions and 20 cores per
To make our benchmark architecture mimic a real-world pro- framework.
duction analytics stack, we use AWS EC2. We set up a Cloud- For event-driven frameworks we use event time as time
Formation stack which runs DC/OS on nine m5.2xlarge characteristic. These frameworks use watermarks to handle
instances (one master, nine slaves). Each of these instances out-of-order data [12]. We choose an out-of-order bound of
has 8 vCPU, 32 GB RAM, a network bandwith of up to 50 ms since we do not have out-of-order data that needs to
10 Gbps and a dedicated EBS bandwidth of up to 3500 Mbps. be handled but we do need to take into account the possible
Each instance has an EBS volume attached to it with 150 GB of time desynchronization between the brokers. Assume that,
disk space. We use DC/OS as an abstraction layer between in the case of a constant rate of data, two events that need to

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1850 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

TABLE 1 possible that the watermark had already progressed past


Framework Configuration Parameters the log append time of the second event by the time it
a. Apache Flink (v1.9.1)
entered the system. This might happen if one broker is
lagging 50 ms behind on another broker in system time. In
Parameter Value Default
this case the join will not take place for this pair of events
JobManager count 1 /
since it will be seen as 50 ms out of order. We assume that
TaskManager count 5 /
JobManager CPU 2 / the time difference between the brokers is not more than 50
JobManager heap / memory 8 GB / 8 GB / ms and therefore, we choose an out-of-orderness bound of
TaskManager CPU 4 / 50 ms. We do not use a watermark larger than 50 ms
TaskManager heap / memory 18 GB / 20 GB /
Number of task slots 4 1 because this inherently increases the latency [12], [13].
Default parallelism 20 1 Event-driven frameworks buffer events before they are
Time characteristic event time processing time sent over the network to reduce the load. Buffers are flushed
State backend FileSystem InMemory when they are full or when a configurable timeout has
Buffer timeout 100 ms (L: 0 ms) 100 ms
Checkpoint interval 10 000 ms None passed. This timeout usually has a default value of 100 ms,
Watermark interval 50 ms / which is what we use for throughput measurements. For the
Out-of-orderness bound 50 ms / latency measurement workload, we disable buffering by set-
Object reuse enabled disabled
ting the buffer timeout of Flink and the linger time of Kafka
Streams to 0 ms, thereby optimizing for latency. Structured
b. Apache Kafka Streams (v2.1.0)
Streaming and Spark Streaming do micro-batching and,
Parameter Value Default therefore, intrinsically buffer events. A more thorough expla-
Instances count 5 / nation of this is given in the Supplemental File in Sections
Number of threads / CPUs 4 /
Java heap / memory 18 GB / 20 GB /
4.1.1 and 4.2.1, available online.
Kafka topic parallelism 20 1
Commit interval 1000 ms 30 000 ms
Message timestamp type LogAppendTime CreateTime
4.1 Apache Flink
Linger ms 100 (L: 0) 100 For Apache Flink [12], the configuration settings are listed
Grace period in Table 1(a). The five task managers each have 4 CPUs
join & tumbling window 5s (PB, ST: 30s) /
sliding window 50 ms /
and therefore, four task slots as suggested by [14]. Addi-
Retention time interval + grace 1 day tionally, they get 20 GB memory of which 18 GB is
Max task idle ms 0 (SB: 300 000) 0 assigned to the heap and the other 2 GB is left for off-heap
Producer compression lz4 (L: none) none allocation to network buffers and the managed memory of
Producer batch size 200 KB (L: 16 KB) 16 KB
the task manager [14].
Three state backends are currently available in Flink: Mem-
c. Spark Streaming and Structured Streaming (v2.4.1)
oryStateBackend, FileSystemBackend en RocksDBBackend.
Parameter Value Default
We use the FileSystemBackend since this is the recommended
Master count 1 / backend for large state that fits in heap memory. MemorySta-
Worker count 5 /
Master CPUs / memory 2 / 8 GB / teBackend is used for jobs with little state and in development
Worker CPUs / memory 4 / 20 GB / and debugging stages. RocksDB is a state store kept off-heap
Driver cores / heap 2 / 6 GB / and is recommended if the state does not fit in the heap mem-
Executor cores / heap 4 / 17 GB / ory of the task managers [14]. The FileSystemBackend we use
Number of executors 5 /
Default parallelism 20 parallelism of is backed by HDFS and checkpointing is done every ten
parent or cores seconds. Due to Flink’s asynchronous checkpointing mecha-
SQL shuffle partitions 5 (str.), 20 (sp.) 200 nism, the processing pipeline is not blocked while checkpoint-
Serializer kryo java
Locality wait 100 ms 3s
ing. Flink’s mechanism of keeping state reduces the load on
Garbage collector G1GC Parallel GC the garbage collector, as opposed to the mechanism used by
Initiating heap occupancy 35% 45% Spark which still heavily relies on JVM memory management,
Parallel GC threads 4 / as described in Section 4.4.
Concurrent GC threads 2 /
Max GC pause ms 200 200 We define a watermark interval of 50 ms. This means
Micro-batch interval the current watermark is recomputed every 50 ms. Finally,
Initial stages 200 ms / we enable object reuse since this can significantly increase
Analytics stages 1 s (ST: 3 s, 5 s) / performance.
Trigger interval (str.) 0 /
Block interval ms 50 (ST: 150, 250 sp.) 200
Watermark ms (str.) 50 /
minBatchesToRetain (str.) 2 100
4.2 Apache Kafka Streams
In 2016, Apache Kafka released Kafka Streams [13], a client
((str.) refers to Structured Streaming; (sp.) refers to Spark Streaming) library for processing data residing on a Kafka cluster.
Specific workload parameters are noted by L (latency workload), ST (sustainable Kafka Streams does not make use of a master-slave architec-
throughput workload), SB (single burst workload) and PB (periodic burst
workload). ture and does not require a cluster but runs with different
threads that rely on the Kafka cluster for data parallelism,
be joined or aggregated arrive at the extreme beginning and coordination and fault tolerance. All Kafka Streams instan-
end of the interval. If not all brokers have synchronized ces share the same consumer group and application id such
time and the out-of-orderness bound is set to zero, it is that they all process a part of the input topic partitions.

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1851

Each instance will be running on four threads to optimally spreading work over all executors. In this use case, the data
use all resources. Kafka Streams stores state on Kafka topics is equally spread over all keys. Hence, using the same num-
and therefore, does not require a HDFS cluster. ber of partitions as the number of cores leads to less overhead
When using the DSL, we need to set the grace period and better performance.
and retention times for the window stages. Since each In all workloads, Spark Streaming checkpoints at the
input record leads to an output update record [13] and default interval which is a multiple of the batch interval that
only the complete output records are kept, the grace is at least 10 seconds [17]. We use the kryo serializer for seri-
period does not introduce additional latency and has alization and register all classes since it is faster and more
been put at a high number of 5s. The retention time is a compact than the default java serializer.
lower-level property that sets the time events should be
retained, e.g., for interactive queries, and has no influ- 4.4 Apache Spark: Structured Streaming
ence on the latency. We set this to be equal to the slide In 2016, Apache Spark released a new stream processing
interval plus grace period. API called Structured Streaming [6] which enables users
We do not use message compression for latency meas- to program streaming applications in the DataFrame
urements. For measuring throughput and resource con- API. The DataFrame API is the dominant batch API, and
sumption, we use slightly different parameters, as Structured Streaming brings Apache Spark one step
recommended by [22]. We keep the linger time at the closer to unifying their batch and streaming APIs. Struc-
default 100 ms. We use lz4 message compression to reduce tured Streaming offers a micro-batch execution mode
the message sizes and load on the network, thereby increas- and a continuous processing mode for event-driven
ing throughput. We also increase the size of output batches processing. We use the micro-batch approach because
to reduce the network load. the continuous processing mode is still experimental and
For processing a single burst, we set the maximum task does not support stateful operations at the time of this
idle time to 300 000 ms. By doing this, Kafka Streams waits writing. Additionally, the micro-batch API does not sup-
an increased amount of time to buffer incoming events on port chaining of built-in aggregations, therefore, we do
all topics before it starts processing. This prevents the sys- not execute the sliding window stage. However, we also
tem from running ahead on some topics and subsequently include an implementation of the final two stages using
discarding data on other topics if their event time is too far map functions with custom state management, since this
behind. circumvents the issues with built-in aggregations.
For Structured Streaming, we trigger job queries as fast
4.3 Apache Spark: Spark Streaming as possible, which is enabled by setting the trigger interval
Apache Spark [7] is a cluster computing platform that con- to 0 ms. Queries run in the default append mode, which
sists of an extended batch API in combination with a library means only new rows of the streaming DataFrame will be
for micro-batch stream processing called Spark Streaming. written to the sink. Due to the architectural design of Struc-
The job runs with one driver and five executors. On each tured Streaming, checkpointing is done for each batch [6].
worker, we allocate 1 GB of the 20 GB to JVM overhead. The This leads to a very high load on HDFS. To be able to run
executors use 10 percent of the remaining memory for off- the jobs at full capacity, a significant size increase of the
heap storage, which leaves 17 GB for heap memory. For the HDFS cluster was necessary since the checkpointing for
driver we allocate 6 GB of heap memory. Checkpoints are every micro-batch led to seconds latency increase for the
stored in HDFS. windowing phases.
When executing the initial stages of the pipeline, i.e., The default parallelism parameter arranges parallelism
ingesting and parsing, the jobs are configured to have a on the RDD level. Structured Streaming, however, makes
micro-batch interval of 200 ms. In the initial stages, the use of DataFrames. The parallelism of Dataframes after
selectivity of the data is one on one so a lower batch interval shuffling is set by the SQL shuffle partitions parameter.
leads to lower latencies. For the analytics stages we set the Since all workloads have small state and since opening
micro-batch interval to the same length as the interval at connections for storing state is expensive, performance
which sensors send their data and on which we will do our improves significantly when the number of SQL shuffle par-
aggregations, which is one second. For the sustainable titions is reduced to 5. When doing this for Spark Streaming,
throughput workload, we set the batch interval to three and we did not notice a similar performance improvement.
five seconds since this increases the peak throughput [7]. Structured Streaming checkpoints the range of offsets proc-
Spark Streaming splits incoming data in partitions based on essed, after each trigger. In contrast, the default checkpoint
the block interval. Therefore, it is recommended to choose interval of Spark Streaming is never less than 10 seconds.
the block interval to be equal to the batch interval divided Due to the heavier use of checkpointing, Structured Stream-
by the desired parallelism [17]. ing benefits more from reducing the amount of SQL shuffle
For reading from Kafka we use the direct stream approach partitions. We use kryo for fast and compact serialization.
with PreferConsistent as location strategy, meaning the 20 We use the G1 garbage collector (GC) with the parameters
Kafka partitions will be spread evenly across available exec- listed in Table 1 to reduce GC pauses. A more thorough
utors [17]. Often it is recommended to use three times the explanation of the reasoning behind this can be found in the
number of cores as the number of partitions. This, however, Supplemental File, available online. Finally, we set the mini-
only leads to performance improvement in the case of data mum number of batches that has to be retained to two,
skew. When keys are not equally distributed, having more which further reduces memory consumption and, therefore,
partitions than the number of cores allows more flexibility in GC pauses.

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1852 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

TABLE 2
Result Overview for Flink (FL), Kafka Streams (K), Spark
Streaming (SP) and Structured Streaming (ST)

Workload Decision factor FL K SP ST


Latency stateless stages 4 4 1 1
built-in stateful API 4 3 4 1
customized stateful API 4 4 2
Sustainable throughput unconstrained latency 2 1 3 4
constrained latency 4 3 2 1
Single burst initial output 4 3 1 2
processing time 2 1 2 4
Periodic burst stateless stages 3 3 1 4
built-in stateful API 4 1 3 1
customized stateful API 4 4 2

Scores can only be interpreted relatively to the scores of the other frameworks.
Higher scores indicate better performance.

5 RESULTS
In this section, we discuss the results of each of the work-
loads. First, we discuss the latency and sustainable through-
put workloads and afterwards the burst workloads. For all
of the results we did multiple runs with similar results and
we discarded the first five minutes of the result timeseries to
filter out warm-up effects. An overview of the results has
been summarized in Table 2 per workload and for each addi-
tional decision factor. These decision factors influence the
decision for a framework based on job requirements or pipe-
line complexity. A higher score refers to better performance.
The scores do not have an absolute value and should be
interpreted relatively to the scores of the other frameworks.

5.1 Workload for Latency Measurement


This first workload was designed to measure latency accu-
rately, as was described in Section 3.4. The latency distribu-
tion for each of the pipeline complexities and frameworks
has been given in Fig. 2.
By using event-driven processing and no buffering, Flink
and Kafka Streams have the lowest latency for ingesting the
data, with a median of 0 ms and a p99 of 1–2 ms. When pars-
ing is added to the pipeline, the median latency of both
frameworks increases by 1 ms. Flink has a lower p99 latency
than Kafka Streams, 2 ms and 7 ms respectively. When we
experiment with higher buffer timeouts and linger times,
these latencies rise linearly, as documented in the Supple- Fig. 2. Latency distribution for all frameworks. The stage until which the
mental File and confirming [12], available online. The laten- pipeline was executed is shown on the y-axis. The chart for Structured
cies for Spark Streaming and Structured Streaming are Streaming has a different x-axis scale than the other charts.
considerably higher, as expected for a micro-batch approach.
For Spark Streaming we chose a micro-batch interval of raised to one second because this is equal to the join interval.
200 ms for these stateless stages, giving us median latencies Since processing one batch takes longer, Structured Stream-
of around 140 ms to 150 ms and p99 latencies of 316 ms to ing also forms larger micro-batches. The median latency of
350 ms. Structured Streaming uses the approach of process- Spark Streaming was 650 ms after joining. The micro-batch
ing as fast as possible. As a result, the median latency is interval is 1,000 ms so the buffering time for an event is on
slightly lower than for Spark Streaming, with 105 ms for average 500 ms. Since the median latency is 650 ms, we infer
ingesting and 148 ms when parsing is added. We notice, that it takes an additional 150 ms to process all events of the
however, much more pronounced tail latencies with events batch. This is confirmed by the form of the distribution,
requiring up to 930 ms of processing time. This can be an which is rather uniform between 170 ms and 1,140 ms. Speed
important factor for jobs under tight SLAs. and flow observations of one measurement point are sent
When we include the joining stage in the processing pipe- immediately after each other. The lower latencies come from
line, the differences in latencies between the frameworks pairs of events that arrived at the end of the tumbling win-
grow. For Spark Streaming, the micro-batch interval was dow while the higher latencies come from pairs of events

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1853

that arrived at the beginning of the window. Processing a ms for the sliding window stage. This can be an issue for use
batch of events, therefore, took between 140–170 ms. Further- cases with tight latency requirements.
more, we see that the variance of the latency increased after For the older Spark Streaming API, we measure low
adding the join. This is due to the fact that the events are median latencies of 667 for tumbling window and 765 ms
joined using a tumbling window. The p99.999 latency is 200 after adding the sliding window. However, we see the 99th
ms higher than the p99 latency, meaning there are some percentile latencies increase with complexity. We also notice
minor outbreaks but that in general latency stays within pre- that the uniform distribution caused by the tumbling win-
dictable ranges. The latencies for Structured Streaming suf- dow join persists throughout the following stages. The tum-
fered from a much longer tail with the p99 above 1,400 ms, bling window as well as sliding window have slide intervals
while the median stays at 582 ms. of one second. Therefore, no latency reductions can be
For the event-driven processing frameworks, Flink and obtained by using customized implementations since the
Kafka Streams, the type of join used is an interval join which latency is directly tight to the fixed micro-batch interval.
can obtain a much lower latency. By joining events that lay Spark Streaming becomes a very good contender for these
within a specified time range from each other, the frame- more complex transformations to the built-in high-level
work can give an output immediately after both required APIs of natively faster frameworks Flink and Kafka Streams.
events arrive and does not wait for an interval to time out. For Flink and Kafka Streams we compare two implementa-
The latency of joining is the lowest for Flink with a median tions. When using the built-in API, the latency of Kafka
round-trip latency of 1 ms and 99 percent of the events proc- Streams increases the least with the addition of the tumbling
essed under 3 ms. Kafka Streams reaches median latencies window, although this behavior is not sustained when the
of 2 ms and shows much higher tail latencies with a p99 of sliding window is added. For all frameworks, we notice
30 ms and a p99.999 of 218 ms. We always compute the growing tail latencies when complexity is added via joins
latency of the last event that was required to do the join. If and aggregations. Partly this is due to the checkpointing
we would incorporate the latency of the first event, these mechanisms that are required to do stateful stream process-
latencies would be higher, but we would not be expressing ing. As state increases, the garbage collection becomes more
processing time. time consuming and less effective because a larger state
The last two stages of the processing pipeline are the win- needs to be kept and checked at each cycle. Finally, the shuf-
dowing stages: a tumbling window, as well as a sliding win- fling and interdependence between the observations to
dow are implemented and chained together. At the time of generate the output increases the variance of the latency.
writing, Structured Streaming did not support chaining Through customizing stateful implementations, we can
built-in aggregations. Therefore, we do not execute the built- heavily reduce the latency for some use cases. For Flink,
in sliding window stage for Structured Streaming. We can using a custom trigger for the tumbling window reduced the
circumvent this issue by using map functions with custom median latency to 1 ms with a p99 of 5 ms. This is much lower
state management and processing time semantics, as dis- than when using the default trigger because the default trig-
cussed later. The performance of the built-in tumbling win- ger forces events to be buffered till the end of the window
dow behind the join led to a large performance degradation and added an additional delay due to the burst of data that
with p99 latencies of 4,649 ms and a median latency of 2,807 needs to be processed at the end of the interval. By allowing
ms. There are a few main contributors to this 2,800 ms events to be sent out as fast as possible, the latency can be sig-
median latency. The first one is micro-batching. Despite set- nificantly reduced. Similarly, adding a processor function
ting the trigger to 0 ms to achieve processing as fast as possi- with managed keyed state and timers to do the sliding win-
ble, we still see micro-batches occurring of around 800–1,000 dow keeps the median latency at 1 ms and slightly increases
ms with occasional increases due to garbage collection. With the p99 to 215 ms. This shows that the flexibility of the Flink
a constant stream coming in this means that each event API enables optimizing the way stateful operations are done
already had an average delay of 400-500 ms before process- by e.g., reducing the time and amount of data that is buff-
ing has even started. The processing time of the batch in ered. For Kafka Streams we also notice large latency reduc-
which the event resides then adds another 1,000 ms to this tions (4x) when using the low-level API, however, not as
time. Once a record has been joined, its propagation to the large as those of Flink. Besides, it is important to keep in
tumbling window phase is delayed until event time has mind that not all processing flows can be optimized by a cus-
passed the window end and watermark interval, which tomized implementation. In this flow, this was the case
means records are buffered for another micro-batch interval. because it unnecessarily blocks events until it reaches a win-
These three factors together add up to a base latency of 2,800 dow timeout.
ms and an even higher p99 latency. As confirmed by the We can conclude that event-driven frameworks, such as
developers of Structured Streaming in [6], the micro-batch Flink and Kafka Streams, are preferred when doing simple
mode of Structured Streaming has been developed to sustain ETL operations. Built-in windowing operations reduce the
high throughput and should not be used for latency-sensi- difference between micro-batching and event-driven sys-
tive use cases. By using customized stateful mapping func- tems, shifting in favor of Spark Streaming. The type of state-
tionality with processing time semantics, we can improve ful transformation chosen has a large influence on latency.
performance since we do not rely on watermark propagation Event-driven frameworks make it possible to do interval
anymore. As can be seen in Fig. 2g, the latency of the tum- joins and stateful operations with custom triggers and state
bling window stage now has a median of 1,277 ms. Adding the management which can lead to significant latency reductions
sliding window stage, leads to a median latency of 1,704 ms. compared to built-in windowing functionality. Structured
We still see large tail latencies for both stages, going up to 3,700 Streaming also benefits from a customized implementation

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1854 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

of a second. Hence, we use this latency merely to compare


broad trends. Memory usage fluctuations are shown by the
minimum, maximum and average values of the run to avoid
cluttering the graph. We take a look at the growth in heap
usage and the effectiveness of GC cycles to clean up the heap
memory as throughput increases.
For Spark Streaming, we increase the batch interval since
this increases the peak sustainable throughput significantly.
With a batch interval of one second Spark Streaming was not
able to handle a peak sustainable throughput of over 20 000
events per second. With a batch interval of three seconds a
sustainable peak throughput of 26 000 events per second is
reached. When we increase the batch interval to five seconds,
it increases to 30 000 events per second which confirms that
sustainable throughput can be increased by increasing the
batch interval, but this comes with an inherent latency cost.
We notice that for the runs with low load the latency is around
300 ms longer than half of the batch interval, implying that the
processing of the batch took approximately 300 ms. As
throughput increases, the latency steadily increases as well
due to the increased processing time for larger batches. Jobs
start incurring delays when the processing time becomes
larger than the batch interval. At this point, the latency
approaches 1.5 times the batch interval since this includes the
buffering at the receiver side. Structured Streaming, on the
other hand, uses a dynamic batch interval to improve the data
processing speed. When the throughput increases, the batch
interval and latencies increase as well. This mechanism allows
Structured Streaming to handle a throughput of 115 000
events per second with a CPU utilization of 50 percent. When
the throughput increases any further the batch interval passes
Fig. 3. Sustainable throughput: performance for different throughput lev-
the 10 second median latency threshold we imposed, but still
els for the tumbling window stage. Each marker represents a benchmark remains sustainable. At this point, we are confronted with a
run for the corresponding throughput level. The light grey zone marks trade-off between latency and throughput. As stated in [6],
unsustainable levels of throughput. Structured Streaming has different Structured Streaming has been optimized for throughput. For
scale on x-axis.
latency-sensitive use cases, they recommend using the contin-
uous processing mode which is still experimental. Since we
since we do not rely on watermarks anymore for output to be focus on benchmarking real-time processing systems, we do
send out. When tight latency SLAs are imposed, the longer not allow latency to go above 10 seconds in this workload. Fol-
tail latencies and larger distribution spread for complex lowing this definition, Structured Streaming has a peak sus-
pipelines should be kept in mind. tainable throughput of 115 000 events per second while
maintaining a median latency under 10 seconds. The pattern
5.2 Workload for Sustainable Throughput of increasing latencies for increasing throughput is not appar-
Measurement ent for Flink and Kafka Streams. The main reason is that Flink
The sustainable throughput for each of the frameworks is and Kafka Streams do event-driven processing with very lim-
measured for execution of the pipeline up to and including ited buffering on the receiver side. When the constant rate
the tumbling window stage. In Fig. 3, peak sustainable throughput is at such a level that it causes backpressure on
throughput is denoted by the beginning of the grey zone. In the receiver side, the throughput is already at unsustainable
this zone, either mean CPU went above a threshold or proc- levels. Flink can sustain over 30 000 events per second for a
essing could not keep up with the input stream. Dot markers buffer timeout of 100 ms. At this point, CPU levels are over 95
represent benchmark runs at corresponding throughput lev- percent. In order to avoid mean CPU utilization above a
els. The throughput levels bordering this grey zone are often threshold of 80 percent, we set the peak sustainable through-
also at risk of becoming unsustainable since the job does not put at 26 000 events per second. As an experiment we set the
have any buffer for unforeseen bursts, delays, etc. For unsus- buffer timeout to –1, which means that buffers are only
tainable loads, we show CPU and memory metrics but we do flushed when they are full. However, this did not increase
not show latency metrics since increasing delays makes it soar throughput further. This is in agreement with the paper pre-
to very high levels of which the median has no interpretative senting Flink [12], which shows a large throughput increase
value. For most frameworks, an increase in latency signals when increasing the timeout from 0 to 50 ms but a limited
throughput levels becoming unsustainable. It is important to throughput increase when increasing from 50 ms to 100 ms.
note that the latency we show here is computed using a multi- When using the custom trigger implementation, the peak sus-
broker Kafka cluster and is, therefore, only accurate to a tenth tainable throughput decreased slightly to 25 000 events per

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1855

second, as can be consulted in the Supplemental File, avail-


able online. The custom trigger implementation is slightly
more CPU intensive than the default implementation and
CPU is the bottleneck to increasing throughput further.
Kafka Streams DSL shows relatively constant latencies of
around 700 ms for all sustainable throughput levels. At a
throughput level of 18 500 events per second, average CPU
levels reach 83 percent. When the throughput increases fur-
ther, processing starts lagging behind on the input stream
and latencies soar. When using the processor API to imple-
ment the tumbling window stage, there was a slight increase
in peak sustainable throughput to 20 000 events per second,
as can be consulted in the Supplemental File, available
online.
For all runs, throughput has a linear relationship with
CPU usage. We see that the runs with unsustainable
throughput levels, i.e., in the grey zone, have average CPU
levels of above 80 percent. Additionally, for most frame- Fig. 4. Single burst workload until tumbling window stage: performance
for first three minutes of the run. The grey zone denotes where process-
works, an increase in throughput leads to an increase in ing has caught up on the input stream.
memory usage due to the larger input and larger state. At
sustainable levels of throughput, Flink is the only framework
where we do not notice an increase in average heap memory at around 2,700 events for another eight seconds afterwards.
usage when throughput increases. We see memory ramp up Flink is able to catch up after 81 seconds by publishing
to approximately 9 GB. It then drops back to around 100 MB smaller batches over a longer period of time. Kafka Streams
after it is collected. The maximum heap used only starts requires most time to process all data with 99 seconds and
increasing after throughput reaches unsustainable levels. also publishes smaller batches over a longer period of time.
As visualized in Table 2, our results confirm that Struc- To make the built-in window implementation of Kafka
tured Streaming can reach higher throughput due to micro- Streams process the data in order, the allowed maximum
batch execution optimization. Spark Streaming also shows a task idle time had to be raised as described in Section 4.2.
substantial peak throughput increase when the micro-batch During the processing of the burst, all workers of Flink,
interval is raised. We, however, make the large latency cost Kafka Streams and Spark Streaming have 100 percent CPU
that comes with this explicit. For use cases with tighter utilization. The CPU utilization of Spark Streaming drops
latency requirements, Flink obtains the best results. immediately before the batches are published onto Kafka.
For Structured Streaming, we do not reach 100 percent CPU
5.3 Workload With Burst at Startup for the initial minutes of processing and we notice one
We executed the single burst workload for all frameworks worker having increased CPU utilization compared to the
and all pipeline complexities and implementations. Since we other workers of the cluster. Similar to other workloads,
notice similar behavior for all complexities, we will focus on Flink uses up to 9 GB of the heap while Spark Streaming
the results for the tumbling window stage. Additional results and Kafka Streams stay around 3 GB.
have been put in the Supplemental File in Section 5.3, avail- When executing the workload until other stages or for dif-
able online. In Fig. 4, the behavior of the framework during ferent implementations, we notice similar patterns as can be
the first minutes after startup is shown. The light grey zone consulted in the Supplemental File, available online. The cus-
shows where the framework caught up to the incoming tomized implementation finishes faster for Kafka Streams
stream. In the first row of Fig. 4, we see the output through- and Structured Streaming and reaches a higher peak through-
put with clear differences across frameworks that are mainly put. The peak throughput for Structured Streaming goes up
due to a micro-batch or event-driven approach. Kafka to even 229 000 events per second.
Streams and Flink publish the first output 16 seconds after In conclusion, micro-batch frameworks take in a big load of
startup. Structured Streaming takes longer with 44 seconds. data on start-up and take a while to give initial outputs. Event-
Spark Streaming requires more than a minute of processing driven frameworks take in smaller loads of data and start
time before the first output is published. However, these ini- delivering output sooner. Structured Streaming consistently
tial batches for Spark Streaming, contain around 80 000–110 catches up the fastest, followed by Flink, Spark Streaming and
000 events in six subsequent batches. Structured Streaming finally Kafka Streams.
shows batch sizes going up to 150 000 events for the first five
seconds and manages to catch up the fastest after 53 seconds. 5.4 Workload With Periodic Bursts
Due to its adaptive micro-batch approach it can process large We ran the periodic burst workload for several pipeline
batches of data well and can quickly catch up with the input complexities. The results for the tumbling window stage are
data, validating the statements made by the developers of shown in Fig. 5 and for the join stage in Fig. 6. Additional
Structured Streaming in [6]. Spark Streaming also uses a results have been put in the Supplemental File in Section 5.4,
micro-batch approach, causing a similar large burst in output available online. We zoom in on a one-minute-interval near-
as observed with Structured Streaming. After 76 seconds ing the end of our run to get a clearer view of what happens
most of the data is processed, however batches stay elevated after a burst of data is ingested. Similar to the workload for

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1856 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

Fig. 7. Periodic burst workload for customized tumbling window imple-


mentations for Flink, Kafka Streams, and Structured Streaming: latency
metrics for one minute of the run.

final batch. The main reason for this is the increased process-
ing time required to do all the computations for this burst of
data. Besides, the events that come in right after the burst
incur a scheduling delay due to this increased processing
time and, therefore, have inflated latencies as well.
In Fig. 5 we also show the performance of the default trig-
ger implementation for Flink and the DSL implementation
Fig. 5. Periodic burst workload until the tumbling window stage: perfor-
mance metrics for the last minute of the run. for Kafka Streams. We see that the impact of bursts on the
latency is the least for Flink with the median latency rising
sustainable throughput, latency measurements are only from around 800 ms to 1,250 ms and the p99 latency staying
accurate to tenth of a second. fairly constant and under 1,500 ms. Due to the lower impact
When we look at the input throughput, we can clearly dis- of bursts on the processing time of Flink, the period for which
criminate the periodic bursts of around 38 000 messages latency remains at higher levels is also shorter. When we exe-
every ten seconds. In between bursts, the input throughput cute this workload for earlier stages, such as the parse and
stays at 400 messages per second. For the tumbling window join stage (Fig. 6), we see a larger effect on the p50 and p99
stage, each output event requires on average 3.17 input latency. This can be explained by the fact that the tumbling
events. For the join stage, two input events are required. This window applies buffering which dampens latency differen-
explains the discrepancy in messages per second between ces across events. When using custom triggers, this buffering
the input and output throughput. The differences in the is reduced as can be seen in Fig. 7. We now see clear reactions
height of the output throughput rate between frameworks to bursts again and a very low latency in between bursts. Fur-
can be explained by the fact that some frameworks ingest the thermore, the p99 latency after a burst is 1,200 ms which is
burst at once while others ingest it in several chunks and out- still lower than the almost constant latency of the default trig-
put the results more gradually. ger implementation. When we use the low-level API of Kafka
In Fig. 5, the relationships between the different perfor- Streams to implement this stage, the latency of processing a
mance metrics reveal some interesting patterns. For Spark burst is similar to that of Flink and much lower than with the
Streaming, we see that after a burst of the input throughput, DSL implementation. With the DSL implementation, the
CPU levels rise and are followed by a burst in the output median and p99 latency rise considerably after a burst, up to
throughput. The latency increases for two to three seconds 2,800 ms. For the parsing stage, the latency of Kafka Streams
after the burst, which can be noted by the width of the peak. is even slightly lower than that of Flink. In contrast, we can
The median latency during the low load periods is around see in Fig. 5 that the latency of Structured Streaming exhibits
700 ms. However, the first seconds after a burst events have a very irregular course with the median fluctuating between
an inflated median latency of 1,370 ms, then it goes up to 3,000 ms and 4,500 ms and no clear reactions to bursts. This is
2,050 ms after which it falls back to 1,350 ms again for the due to the delays in watermark propagation that have been
explained in Section 5.1. When we use the customized imple-
mentation that does not use watermarks (Fig. 7), the median
latency remains under the two seconds. Structured Streaming
can sustain the largest throughput due to its dynamic micro-
batch approach and it also benefits from that when process-
ing bursts. For the join stage and the parsing stages, we see
lower and less variable latencies and no clear reactions to
bursts. Finally, we see that particularly for Kafka Streams
and Spark Streaming immediately after a burst, the distribu-
tion of the latency becomes much more narrow, with median
and p99 latencies almost converging.
For CPU utilization, we see that for most frameworks all
Fig. 6. Periodic burst workload until the join stage: performance metrics workers are equally utilized when processing bursts, with
for the last minute of the run. almost all lines in Fig. 5 coinciding. During the processing

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1857

of a burst we see CPU usage of 60-90 percent for all workers, Kafka Streams have an advantage due to the interval join
while the CPU utilization throughout the rest of the run capability. This type of join can output events with a much
remains around 2-4 percent. Structured Streaming shows lower latency, compared to the tumbling window join avail-
different behavior with an average load of 6-10 percent and able in micro-batch frameworks. The latency differences
peaks between 20-40 percent mainly contributed by one between event-driven and micro-batch systems disappear
worker. This worker had almost constantly higher CPU when the pipeline is extended with built-in tumbling and
usage than the other workers in the cluster. sliding windows. For these operations, Spark Streaming
When looking at the heap usage throughout this run, we shows lower tail latencies and outperforms the other frame-
see that Structured Streaming uses the least memory. For works. However, by using the flexibility frameworks offer
Flink we see memory usage ramp up to 10 GB before a GC for these stateful operations through custom triggers and
is triggered. These GC cycles are synchronized with the fre- low-level processing capabilities, pipelines can be optimized
quency of the bursts. After a burst has been processed, GC to reach lower latency.
is initiated. For the other frameworks this behavior is not as A second important metric is throughput. Structured
explicit. The amount of memory that is used is not necessar- Streaming is able to sustain the highest throughput but this
ily important for framework performance. What is impor- comes at significant latency costs. If latency and throughput
tant, is that GCs stay effective and that memory frees up are both equally important, Flink becomes the most interest-
well after each GC, which is the case for all frameworks. ing option. When processing a single large burst at startup,
As a conclusion, we can state that using custom imple- Flink outputs the first events the soonest while Structured
mentations can not only reduce the latency of the pipeline in Streaming finishes processing the entire burst the earliest.
general but also the latency of processing bursts. When we For use cases where the input stream exhibits occasional
use only high-level APIs, Structured Streaming shows much bursts, the least performance impact is perceived when using
resiliency against bursts for the parsing and joining pipelines Flink as the processing framework. When using the default
and also uses the least resources. Structured Streaming bene- implementation, the p99 latency of Flink stays constant
fits greatly from its micro-batch execution optimizations throughout bursts while for other frameworks significant
when processing bursts, confirming [6]. Due to its dynamic increases are noted. The latency of processing bursts can
micro-batch interval, it reaches much lower latencies than again be optimized by using low-level APIs.
Spark Streaming for these earlier stages. When the pipeline Finally, the results of this paper can be used as a guideline
gets more complex Flink suffers the least from processing when choosing the right tool for a processing job by not
bursty data, followed by Spark Streaming. Detailed visual- only focusing on an inter-framework comparison but also
izations of the other workloads has been put in the Supple- highlighting the effects of different pipeline complexities,
mental File, available online. implementations and data characteristics. When choosing a
framework for a use case, the decision should be made based
on the most important requirements. If subsecond latency is
6 CONCLUSION critical, even in the case of bursts, it is advised to use highly
In this paper we have presented an open-source benchmark optimized implementations with an event-driven framework
implementation and full analysis of two popular stream such as Flink. When subsecond latency is not required, Struc-
processing frameworks, Flink and Spark Streaming, and two tured Streaming offers a high-level, intuitive API that gives
emerging frameworks, Structured Streaming and Kafka very high throughput with minimal tuning. It is important to
Streams. Four workloads were designed to cover different keep in mind that at the time of this writing parts of the Struc-
processing scenarios and ensure correct measurements of all tured Streaming API were still experimental and not all pipe-
metrics. For each workload, we offer insights on different lines using built-in operations were supported yet. For those
processing pipelines of increasing complexity and on the use cases, Spark Streaming with a high micro-batch interval
parameters that should be tuned for the specific scenario. offers a good alternative. For ETL jobs on data residing on a
Our latency workload provides correct measurements by Kafka cluster that do not require very high throughput and
capturing time metrics on a single machine, i.e., Kafka bro- have reasonable latency requirements, Kafka Streams offers
ker. Furthermore, dedicated workloads were defined to mea- an interesting alternative to Flink with the additional advan-
sure peak sustainable throughput and the capacity of the tage that it does not require a cluster pre-installed.
frameworks to overcome bursts or catch up on delay. We dis-
cuss the metrics latency, throughput and resource consump-
tion and the trade-offs they offer for each workload. By 7 LIMITATIONS AND FURTHER RESEARCH
including built-in as well as customized implementations This benchmark could be extended with joins with static
where possible, we aim to show the advantages of having a datasets and the stateful operators should be analyzed
flexible API at your disposal. under different window lengths. As well, other frameworks
A concise overview of the results is shown in Table 2. For could be included such as Apex, Beam and Storm. Finally,
simple stateless operations such as ingest and parse, we con- the fault tolerance and scalability of the frameworks under
clude that Flink processes with the lowest latency. For these these workloads are interesting directions for future work.
tasks, the newer Structured Streaming API of Spark shows
potential with lower median latencies than Spark Streaming.
However, it suffers from longer tail latencies making it less ACKNOWLEDGMENTS
predictable to reach SLAs with tight upper bounds. For the This research was done in close collaboration with Klarrio, a
joining stage, event-driven frameworks such as Flink and cloud native integrator and software house specialized in

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1858 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

bidirectional ingest and streaming frameworks aimed at Giselle van Dongen (Member, IEEE) is working
toward the PhD degree at Ghent University,
IoT & Big Data/Analytics project implementations. For more Ghent, Belgium, teaching and benchmarking
information please visit https://klarrio.com. real-time distributed processing systems such as
Spark Streaming, Flink, Kafka Streams, and
Storm. Concurrently, she is lead data scientist at
REFERENCES Klarrio specialising in real-time data analysis,
processing and visualisation.
[1] J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and
V. Markl, “Benchmarking distributed stream processing engines,”
in Proc. IEEE 34th Int. Conf. Data Eng., 2018.
[2] Z. Karakaya, A. Yazici, and M. Alayyoub, “A comparison of stream
processing frameworks,” in Proc. Int. Conf. Comput. Appl., 2017,
pp. 1–12. Dirk Van den Poel (Senior Member, IEEE)
[3] S. Chintapalli et al., “Benchmarking streaming computation engines: received the PhD degree. He is currently a senior
Storm, flink and spark streaming,” in Proc. IEEE Int. Parallel Distrib. full professor of Data Analytics/Big Data at Ghent
Process. Symp. Workshops, 2016, pp. 1789–1792. University, Belgium. He teaches courses such as
[4] R. Lu, G. Wu, B. Xie, and J. Hu, “Stream bench: Towards benchmark- Big Data, Databases, Social Media and Web
ing modern distributed stream computing frameworks,” in Proc. Analytics, Analytical Customer Relationship Man-
IEEE/ACM 7th Int. Conf. Utility Cloud Comput., 2014, pp. 69–78. agement, Advanced Predictive Analytics, and
[5] S. Qian, G. Wu, J. Huang, and T. Das, “Benchmarking modern dis- Predictive and Prescriptive Analytics. He co-
tributed streaming platforms,” in Proc. IEEE Int. Conf. Ind. Tech- founded the advanced master of science in mar-
nol., 2016, pp. 592–598. keting analysis, the first (predictive) analytics
[6] M. Armbrust et al., “Structured streaming: A declarative API for master program in the world as well as the master
real-time applications in apache spark,” in Proc. Int. Conf. Manage. of science in statistical data analysis and the master of science in busi-
Data, 2018, pp. 601–613. ness engineering/data analytics.
[7] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica,
“Discretized streams: Fault-tolerant streaming computation at
scale,” in Proc. 24th ACM Symp. Operating Syst. Princ., 2013,
" For more information on this or any other computing topic,
pp. 423–438.
[8] A. Shukla and Y. Simmhan, “Benchmarking distributed stream please visit our Digital Library at www.computer.org/csdl.
processing platforms for IoT applications,” in Proc. Technol. Conf.
Perform. Eval. Benchmarking, 2016, pp. 90–106.
[9] A. Shukla, S. Chaturvedi, and Y. Simmhan, “RIoTBench: An IoT
benchmark for distributed stream processing systems,” Concur-
rency Comput. Pract. Experience, vol. 29, no. 21, 2017, Art. no. e4257.
[10] G. van Dongen, B. Steurtewagen, and D. Van den Poel, “Latency
measurement of fine-grained operations in benchmarking distrib-
uted stream processing frameworks,” in Proc. IEEE Int. Congress
Big Data, 2018, pp. 247–250.
[11] J. Kreps et al., “Kafka: A distributed messaging system for log
processing,” in Proc. NetDB, 2011, pp. 1–7.
[12] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and
K. Tzoumas, “Apache flink: Stream and batch processing in a sin-
gle engine,” Bulletin IEEE Comput. Soc. Tech. Committee Data Eng.,
vol. 36, no. 4, pp. 28–38, 2015.
[13] M. J. Sax, G. Wang, M. Weidlich, and J.-C. Freytag, “Streams and
tables: Two sides of the same coin,” in Proc. Int. Workshop Real-
Time Bus. Intell. Analytics, 2018, Art. no. 1.
[14] Flink Documentation, 2019, Accessed: Dec. 19, 2019. [Online].
Available: https://ci.apache.org/projects/flink/flink-docs-stable/
[15] P. Carbone, S. Ewen, G. F ora, S. Haridi, S. Richter, and K. Tzoumas,
“State management in apache flinkÒ : Consistent stateful distributed
stream processing,” Proc. VLDB Endowment, vol. 10, no. 12,
pp. 1718–1729, 2017.
[16] Kafka Streams Documentation, 2019, Accessed: Dec. 19, 2019.
[Online]. Available: https://kafka.apache.org/documentation/
streams/
[17] Apache Spark: Spark programming guide, Accessed: Dec. 19,
2019, 2019. [Online]. Available: https://spark.apache.org/docs/
latest/streaming-programming-guide.html
[18] Structured Streaming, 2019, Accessed: Dec. 19, 2019. [Online]. Available:
https://spark.apache.org/docs/latest/structured-streaming-
programming-guide.html
[19] cAdvisor, 2019, Accessed: Jun. 04, 2019. [Online]. Available:
https://github.com/google/cadvisor
[20] M. Caporaloni and R. Ambrosini, “How closely can a personal
computer clock track the UTC timescale via the Internet?” Eur. J.
Phys., vol. 23, no. 4, pp. L17–L21, 2002.
[21] Kafka Improvement Proposals: KIP-32 - Add timestamps to Kafka
message, 2019, Accessed Jun. 04, 2019. [Online]. Available:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-32
+-+Add+timestamps+to+Kafka+message
[22] Y. Byzek, “Optimizing your apache kafka deployment,” 2017,
Accessed: Dec. 18, 2019. [Online]. Available: https://www.
confluent.io/wp-content/uploads/Optimizing-Your-Apache-
Kafka-Deployment-1.pdf

Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.

You might also like