Evaluation of Stream Processing Frameworks
Evaluation of Stream Processing Frameworks
Abstract—The increasing need for real-time insights in data sparked the development of multiple stream processing frameworks. Several
benchmarking studies were conducted in an effort to form guidelines for identifying the most appropriate framework for a use case. In this
article, we extend this research and present the results gathered. In addition to Spark Streaming and Flink, we also include the emerging
frameworks Structured Streaming and Kafka Streams. We define four workloads with custom parameter tuning. Each of these is optimized
for a certain metric or for measuring performance under specific scenarios such as bursty workloads. We analyze the relationship between
latency, throughput, and resource consumption and we measure the performance impact of adding different common operations to the
pipeline. To ensure correct latency measurements, we use a single Kafka broker. Our results show that the latency disadvantages of using
a micro-batch system are most apparent for stateless operations. With more complex pipelines, customized implementations can give
event-driven frameworks a large latency advantage. Due to its micro-batch architecture, Structured Streaming can handle very high
throughput at the cost of high latency. Under tight latency SLAs, Flink sustains the highest throughput. Additionally, Flink shows the least
performance degradation when confronted with periodic bursts of data. When a burst of data needs to be processed right after startup,
however, micro-batch systems catch up faster while event-driven systems output the first events sooner.
Index Terms—Apache spark, structured streaming, apache flink, apache kafka, kafka streams, distributed computing, stream processing
frameworks, benchmarking, big data
1 INTRODUCTION
N RESPONSE to the increasing need for fast, reliable and accu- evaluation of which framework is most suitable for which use
I rate answers to data questions, many stream processing
frameworks were developed. Each use case, however, has its
case. In short, we make the following contributions:
own performance requirements and data characteristics. Sev- 1) Open-source benchmark and extensive analysis of
eral benchmarks were conducted in an effort to shed light on emerging frameworks, Structured Streaming and
which frameworks perform best in which scenario, e.g., [1], Kafka Streams, besides newer releases of Flink and
[2], [3], [4], [5]. Previous work often benchmarked Flink and Spark Streaming.
Spark Streaming. We extend this work by comparing later 2) Accurate latency measurement of pipelines of differ-
releases of these frameworks with two emerging frameworks ent complexities for all frameworks by capturing
being Structured Streaming, which has shown promising time on a single Kafka broker.
results [6], and Kafka Streams, which has been adopted by 3) Thorough analysis of relationships between latency,
large IT companies such as Zalando and Pinterest. We take a throughput and resource consumption for each of
two-pronged approach of comparing stream processing the frameworks and under different processing pipe-
frameworks on different workloads and scenarios. In the first lines, workloads and throughput levels.
place, we assess the performance of different frameworks on 4) Dedicated parameter tuning per workload for more
similar operations. We look at relationships between metrics nuanced performance evaluations with detailed rea-
such as latency, throughput and resource utilization for dif- soning behind each of the settings.
ferent processing scenarios. The parameter configurations of 5) Inclusion of built-in and customized stateful operators.
a framework have a large influence on the performance in a 6) Guidelines for choosing the right tool for a use case
certain processing scenario. Therefore, we tune parameters by taking into account inter-framework differences,
per workload. Additionally, we evaluate the performance different pipeline complexities, implementation flex-
impact of adding different, complex operations to the work- ibility and data characteristics.
flow of each framework. Understanding the impact of an The code for this benchmark can be found at https://
operation on the performance of a pipeline, is a valuable github.com/Klarrio/open-stream-processing-benchmark.
insight in the design phase of the processing flow. By combin- The rest of this paper is organized as follows. The next sec-
ing these two approaches, we can form a more nuanced tion gives an overview of the related work that formed a
basis for this study. In Section 3 we describe the setup that
was used to conduct this benchmark. Section 4 dives deeper
The authors are with Ghent University, 9000 Ghent, Belgium. into the configurations used for the different frameworks. A
E-mail: {Giselle.vanDongen, Dirk.VandenPoel}@ugent.be. discussion of the results follows in Section 5, followed by gen-
Manuscript received 16 Aug. 2019; revised 10 Jan. 2020; accepted 18 Feb. 2020. eral conclusions. Finally, we outline some of the limitations
Date of publication 5 Mar. 2020; date of current version 30 Mar. 2020. and opportunities for further research. Additional informa-
(Corresponding author: Giselle van Dongen.)
Recommended for acceptance by Henri E. Bal. tion on how this benchmark was conducted can be found in
Digital Object Identifier no. 10.1109/TPDS.2020.2978480 the Supplemental File, which can be found on the Computer
1045-9219 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1846 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020
Society Digital Library at http://doi.ieeecomputersociety. data. Most research, however, merely analyses the maxi-
org/10.1109/TPDS.2020.2978480. mum throughput that can be sustained for a prolonged
period of time, e.g., [1]. In this benchmark, we analyze
throughput in two additional scenarios. First, we investigate
2 RELATED WORK the behavior of the framework under a bursty workload. Sec-
Several benchmarking studies have been conducted over the ond, we monitor the behavior of the framework when it
last years. In this section, we outline the most important dif- needs to process a big burst of data right after startup. Some
ferences. A tabular overview has been given in Section 1 of papers already studied the behavior of stream processing
the Supplemental File, available online. Most of these bench- frameworks under bursty data. Shukla and Simmhan [8],
marks were implemented on one or a combination of the fol- and later RIoTBench [9], studied the behavior of Storm under
lowing frameworks: Spark Streaming, Storm and Flink. In bi-modal data rates with peaks in mornings and evenings,
[4] and [5], Spark Streaming outperforms Storm in peak while Karimov et al. [1] studied the effects of a fluctuating
throughput measurements and resource utilization but workload for Spark, Flink and Storm. Spark and Flink
Storm attains lower latencies. The Yahoo benchmark [3] con- showed similar behavior in the case of windowed aggrega-
firmed that under high throughput Storm struggles. tions, while Flink handled windowed joins better than Spark.
Karakaya et al. [2] extended the Yahoo Benchmark and found Storm was the most susceptible to bottlenecks due to data
that Flink 0.10.1 outperforms Spark Streaming 1.5.0 and bursts. Most other literature studied constant rate workloads
Storm 0.10.0 for data-intensive applications. However, in under different data scales, e.g., [2], [3], [4], [5]. To be able to
applications where high latency is acceptable, Spark outper- generate these different loads, we use temporal and spatial
forms Flink. Karimov et al. [1] confirmed that the general per- scaling similar to [8] and [9].
formance of Spark Streaming 2.0.1 and Flink 1.1.3 is better StreamBench [4] and Qian et al. [5] studied the effects of a
than that of Storm. In this work, we include the latest ver- failing node on throughput and latency. Their results indi-
sions of Spark Streaming 2.4.1 and Flink 1.9.1 due to their cated no significant impact for Spark Streaming while Storm
superior results. Additionally, we include Spark’s new Struc- experienced a large performance decrease. In this work, in
tured Streaming 2.4.1 API and Kafka Streams 2.1.0, which contrast, we investigate the behavior of processing a big
have not been thoroughly benchmarked before. Micro-batch burst of data at startup, mimicking the behavior triggered
systems as well as event-driven frameworks are represented when the job would be down for a period of time.
in this benchmark. Both Spark frameworks are micro-batch This benchmark uses Apache Kafka [11] as distributed
systems [7], meaning that they buffer events on the receiver- publish-subscribe messaging system to decouple input gen-
side and process them in small batches. Kafka Streams and erators from processors. In the Kafka cluster, streams of
Flink work event-driven and only apply buffering when data are made available on partitioned topics. The majority
events need to be send over the network. of past benchmarking literature uses Apache Kafka for this
Previous work mainly focused on benchmarking end-to- purpose as this is representative of real-world usage, e.g.,
end applications in areas such as ETL, descriptive statistics [2], [3], [8], [9], [10]. Furthermore, frameworks, e.g., [6], [12],
and IO, e.g., [2], [3], [8], [9]. Some work has been conducted require a replayable, durable data source such as Kafka to
on benchmarking single operations such as parsing [8], fil- give exactly-once processing guarantees.
tering [4], [8], [9] and windowed aggregations [1]. The met- Finally, the configuration parameters of a framework
rics for these single operations were, however, recorded as have large effects on latency and throughput. Previous
if it were end-to-end applications by including ingesting, research often did not sufficiently document the configura-
buffering, networking, serialization, etc. In this work, we tion settings or used default values for all workloads. In this
also measure the impact of adding a certain operation to a work, framework parameters are tuned separately for each
pipeline. We ensured that each of these operations consti- workload to get a more accurate view of the performance of
tutes a seperate stage in the DAG of the job. We based the a framework on a certain metric or in a certain scenario. A
design of our pipeline on the Yahoo benchmark [3], which thorough discussion follows in Section 4, supported by extra
has also been adopted by [2]. Besides a tumbling window, material in the Supplemental File, available online.
we also include a sliding window for the latency measure-
ment workload. For these types of stateful transformations,
3 BENCHMARK DESIGN
some frameworks offer more flexibility via low-level APIs,
custom triggers and operator access to managed state. In In this section, we describe the benchmark setup. First, we
this work we implemented the tumbling and sliding win- introduce the input data source. Next, we elaborate on the
dow with built-in as well as low-level customized operators operations that were included and the metrics that are mon-
for Flink, Kafka Streams and Structured Streaming to inves- itored. Finally, we describe the different workloads and the
tigate the advantages of using this flexibility. architecture on which the processing jobs are run.
Two of the most important performance indicators in
stream processing systems are latency and throughput, 3.1 Data
which have been included in most previous benchmarks, We perform the benchmark on data from the IoT domain
e.g., [1], [3], [8], [9]. We use the same methodology to mea- because of its increasing popularity and its increasing
sure latency of single operations as in the initial proposal for amount of use cases. We use traffic sensor data from the
this study [10]. The ability of a framework to process events Netherlands provided by NDW (Nationale Databank Weg-
faster and in bigger quantities with a similar setup, leads to verkeersgegevens). This data contains measurements of the
cost reductions and a greater ability to overcome bursts of number of cars (flow) and their average speed at around
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1847
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1848 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1849
monitoring systems (e.g., application logs), preventive main- the benchmark components and the EC2 instances on which
tenance workloads or connected devices (e.g., solar panels). they run. All benchmark components run in Docker contain-
We determine the peak sustainable throughput by run- ers on DC/OS. To mitigate the potential performance differ-
ning the workload repeatedly for increasing data scales and ences between EC2 instances, we run each benchmark run
monitoring latency, throughput and CPU. For latency, we five times on different clusters and select the run with median
check whether the median latency does not increase mono- performance.
tonically throughout the run and remains below 10 seconds. As a message broker, we use Apache Kafka. For most
When the volume of data is higher than the framework can runs, the cluster consists of five brokers (2 vCPU, 6 GB RAM)
process, latency increases constantly throughout the work- which is the same number as the number of workers for each
load due to processing delays. Furthermore, we also check of the frameworks. As described in Section 3.4.1, we use a
whether the framework has processed all events by the time single Kafka broker (6.5 vCPU, 24 GB RAM) for the latency
the stream ends. In this work, we assume that once the input measurement workload. We use Kafka Manager for manag-
stream halts, the framework needs to be able to finish proc- ing the cluster and creating the input and output topics, each
essing in less than ten seconds, otherwise it was either lag- of these with 20 partitions since each framework cluster has
ging behind on the input data stream or batching at intervals 20 processing cores. Messages are not replicated to avoid cre-
higher than 10 seconds. Finally, we monitor whether the ating too much load on the Kafka brokers. The input topics
average CPU utilization of the framework containers does for the speed and flow messages are partitioned on the ID
not exceed 80 percent. since this is the key for joining and aggregating. Addition-
ally, all brokers are configured to use LogAppendTime for
3.4.3 Workload With Burst at Startup latency measurements. The roundtrip latency of the Kafka
The third workload measures the peak burst throughput that cluster is around 1 ms, tested by publishing and directly con-
a framework can handle right after startup. This mimics the suming the message again. We use cAdvisor, backed by
effect of a job trying to catch up with an incurred delay. For InfluxDB, to monitor and store CPU and networking metrics
this, we start up the data publisher before the job has been of the framework containers. Concurrently, the processing
started and let it publish around 6,000 events per second, jobs expose heap usage metrics on an endpoint via JMX,
which is a throughput level that all frameworks are able to which is then scraped and stored by a JMX collector. An
architecture diagram and detailed information on the distri-
handle but still imposes a considerable load. Five minutes
bution of resources over all services has been listed in
later, we bring up the processing job and start processing the
Section 3 of the Supplemental File, available online. Without
data on the Kafka topics from the earliest offset. This equals
the framework clusters, the DC/OS cluster has 45 percent
around 1 800 000 events or 350 MB, with linearly increasing
CPU allocation and 33 percent memory allocation. The
event times. Once the processing job is running, we reduce
the throughput to around 400 messages per second to let framework cluster adds another 28 percent CPU allocation
event time progress but not put any more load on the frame- and 35 percent memory allocation.
work. We let this processing job run for ten minutes. After-
wards, we evaluate the throughput at which the framework 4 FRAMEWORKS
was able to process the initial burst of data.
In this section, we discuss each of the frameworks and their
configurations, as listed in Table 1. We only discuss the con-
3.4.4 Workload With Periodic Bursts figuration parameters for which we do not use the default
Finally, we want to monitor the ability of the framework to values. Parameter tuning is done per workload to get a more
overcome bursts in the input stream. This mimics use cases accurate measurement of performance. Configurations were
such as processing data from a group of coordinated sensors chosen based on the documentation of the frameworks,
or from connected devices that upload bursts of data every expert advice, and an empirical study, which can be further
few hours or use cases which need to be able to sustain peaks consulted in the Supplemental File in Section 4, available
of data such as web log processing. We have a constant online.
stream of a very low volume and every ten seconds there is a Some settings are equal for all frameworks. For the frame-
large burst of data. Each burst of data contains approxi- works that use a master-slave architecture, i.e., Flink and
mately 38 000 events, which is approximately 7.5 MB. The both Spark frameworks, we deploy a standalone cluster with
publishing of one burst takes around 170 to 180 ms. We look one master and five workers in Docker containers. Kafka
at how long latency persists at an inflated level after a large Streams runs with five instances. Each framework gets the
burst and at the effects on CPU and memory utilization. same amount of resources for their master (2 vCPU, 8 GB
RAM) and slaves (4 vCPU, 20 GB RAM each). Parallelism is
3.5 Architecture set to 20 since we have 20 Kafka partitions and 20 cores per
To make our benchmark architecture mimic a real-world pro- framework.
duction analytics stack, we use AWS EC2. We set up a Cloud- For event-driven frameworks we use event time as time
Formation stack which runs DC/OS on nine m5.2xlarge characteristic. These frameworks use watermarks to handle
instances (one master, nine slaves). Each of these instances out-of-order data [12]. We choose an out-of-order bound of
has 8 vCPU, 32 GB RAM, a network bandwith of up to 50 ms since we do not have out-of-order data that needs to
10 Gbps and a dedicated EBS bandwidth of up to 3500 Mbps. be handled but we do need to take into account the possible
Each instance has an EBS volume attached to it with 150 GB of time desynchronization between the brokers. Assume that,
disk space. We use DC/OS as an abstraction layer between in the case of a constant rate of data, two events that need to
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1850 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1851
Each instance will be running on four threads to optimally spreading work over all executors. In this use case, the data
use all resources. Kafka Streams stores state on Kafka topics is equally spread over all keys. Hence, using the same num-
and therefore, does not require a HDFS cluster. ber of partitions as the number of cores leads to less overhead
When using the DSL, we need to set the grace period and better performance.
and retention times for the window stages. Since each In all workloads, Spark Streaming checkpoints at the
input record leads to an output update record [13] and default interval which is a multiple of the batch interval that
only the complete output records are kept, the grace is at least 10 seconds [17]. We use the kryo serializer for seri-
period does not introduce additional latency and has alization and register all classes since it is faster and more
been put at a high number of 5s. The retention time is a compact than the default java serializer.
lower-level property that sets the time events should be
retained, e.g., for interactive queries, and has no influ- 4.4 Apache Spark: Structured Streaming
ence on the latency. We set this to be equal to the slide In 2016, Apache Spark released a new stream processing
interval plus grace period. API called Structured Streaming [6] which enables users
We do not use message compression for latency meas- to program streaming applications in the DataFrame
urements. For measuring throughput and resource con- API. The DataFrame API is the dominant batch API, and
sumption, we use slightly different parameters, as Structured Streaming brings Apache Spark one step
recommended by [22]. We keep the linger time at the closer to unifying their batch and streaming APIs. Struc-
default 100 ms. We use lz4 message compression to reduce tured Streaming offers a micro-batch execution mode
the message sizes and load on the network, thereby increas- and a continuous processing mode for event-driven
ing throughput. We also increase the size of output batches processing. We use the micro-batch approach because
to reduce the network load. the continuous processing mode is still experimental and
For processing a single burst, we set the maximum task does not support stateful operations at the time of this
idle time to 300 000 ms. By doing this, Kafka Streams waits writing. Additionally, the micro-batch API does not sup-
an increased amount of time to buffer incoming events on port chaining of built-in aggregations, therefore, we do
all topics before it starts processing. This prevents the sys- not execute the sliding window stage. However, we also
tem from running ahead on some topics and subsequently include an implementation of the final two stages using
discarding data on other topics if their event time is too far map functions with custom state management, since this
behind. circumvents the issues with built-in aggregations.
For Structured Streaming, we trigger job queries as fast
4.3 Apache Spark: Spark Streaming as possible, which is enabled by setting the trigger interval
Apache Spark [7] is a cluster computing platform that con- to 0 ms. Queries run in the default append mode, which
sists of an extended batch API in combination with a library means only new rows of the streaming DataFrame will be
for micro-batch stream processing called Spark Streaming. written to the sink. Due to the architectural design of Struc-
The job runs with one driver and five executors. On each tured Streaming, checkpointing is done for each batch [6].
worker, we allocate 1 GB of the 20 GB to JVM overhead. The This leads to a very high load on HDFS. To be able to run
executors use 10 percent of the remaining memory for off- the jobs at full capacity, a significant size increase of the
heap storage, which leaves 17 GB for heap memory. For the HDFS cluster was necessary since the checkpointing for
driver we allocate 6 GB of heap memory. Checkpoints are every micro-batch led to seconds latency increase for the
stored in HDFS. windowing phases.
When executing the initial stages of the pipeline, i.e., The default parallelism parameter arranges parallelism
ingesting and parsing, the jobs are configured to have a on the RDD level. Structured Streaming, however, makes
micro-batch interval of 200 ms. In the initial stages, the use of DataFrames. The parallelism of Dataframes after
selectivity of the data is one on one so a lower batch interval shuffling is set by the SQL shuffle partitions parameter.
leads to lower latencies. For the analytics stages we set the Since all workloads have small state and since opening
micro-batch interval to the same length as the interval at connections for storing state is expensive, performance
which sensors send their data and on which we will do our improves significantly when the number of SQL shuffle par-
aggregations, which is one second. For the sustainable titions is reduced to 5. When doing this for Spark Streaming,
throughput workload, we set the batch interval to three and we did not notice a similar performance improvement.
five seconds since this increases the peak throughput [7]. Structured Streaming checkpoints the range of offsets proc-
Spark Streaming splits incoming data in partitions based on essed, after each trigger. In contrast, the default checkpoint
the block interval. Therefore, it is recommended to choose interval of Spark Streaming is never less than 10 seconds.
the block interval to be equal to the batch interval divided Due to the heavier use of checkpointing, Structured Stream-
by the desired parallelism [17]. ing benefits more from reducing the amount of SQL shuffle
For reading from Kafka we use the direct stream approach partitions. We use kryo for fast and compact serialization.
with PreferConsistent as location strategy, meaning the 20 We use the G1 garbage collector (GC) with the parameters
Kafka partitions will be spread evenly across available exec- listed in Table 1 to reduce GC pauses. A more thorough
utors [17]. Often it is recommended to use three times the explanation of the reasoning behind this can be found in the
number of cores as the number of partitions. This, however, Supplemental File, available online. Finally, we set the mini-
only leads to performance improvement in the case of data mum number of batches that has to be retained to two,
skew. When keys are not equally distributed, having more which further reduces memory consumption and, therefore,
partitions than the number of cores allows more flexibility in GC pauses.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1852 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020
TABLE 2
Result Overview for Flink (FL), Kafka Streams (K), Spark
Streaming (SP) and Structured Streaming (ST)
Scores can only be interpreted relatively to the scores of the other frameworks.
Higher scores indicate better performance.
5 RESULTS
In this section, we discuss the results of each of the work-
loads. First, we discuss the latency and sustainable through-
put workloads and afterwards the burst workloads. For all
of the results we did multiple runs with similar results and
we discarded the first five minutes of the result timeseries to
filter out warm-up effects. An overview of the results has
been summarized in Table 2 per workload and for each addi-
tional decision factor. These decision factors influence the
decision for a framework based on job requirements or pipe-
line complexity. A higher score refers to better performance.
The scores do not have an absolute value and should be
interpreted relatively to the scores of the other frameworks.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1853
that arrived at the beginning of the window. Processing a ms for the sliding window stage. This can be an issue for use
batch of events, therefore, took between 140–170 ms. Further- cases with tight latency requirements.
more, we see that the variance of the latency increased after For the older Spark Streaming API, we measure low
adding the join. This is due to the fact that the events are median latencies of 667 for tumbling window and 765 ms
joined using a tumbling window. The p99.999 latency is 200 after adding the sliding window. However, we see the 99th
ms higher than the p99 latency, meaning there are some percentile latencies increase with complexity. We also notice
minor outbreaks but that in general latency stays within pre- that the uniform distribution caused by the tumbling win-
dictable ranges. The latencies for Structured Streaming suf- dow join persists throughout the following stages. The tum-
fered from a much longer tail with the p99 above 1,400 ms, bling window as well as sliding window have slide intervals
while the median stays at 582 ms. of one second. Therefore, no latency reductions can be
For the event-driven processing frameworks, Flink and obtained by using customized implementations since the
Kafka Streams, the type of join used is an interval join which latency is directly tight to the fixed micro-batch interval.
can obtain a much lower latency. By joining events that lay Spark Streaming becomes a very good contender for these
within a specified time range from each other, the frame- more complex transformations to the built-in high-level
work can give an output immediately after both required APIs of natively faster frameworks Flink and Kafka Streams.
events arrive and does not wait for an interval to time out. For Flink and Kafka Streams we compare two implementa-
The latency of joining is the lowest for Flink with a median tions. When using the built-in API, the latency of Kafka
round-trip latency of 1 ms and 99 percent of the events proc- Streams increases the least with the addition of the tumbling
essed under 3 ms. Kafka Streams reaches median latencies window, although this behavior is not sustained when the
of 2 ms and shows much higher tail latencies with a p99 of sliding window is added. For all frameworks, we notice
30 ms and a p99.999 of 218 ms. We always compute the growing tail latencies when complexity is added via joins
latency of the last event that was required to do the join. If and aggregations. Partly this is due to the checkpointing
we would incorporate the latency of the first event, these mechanisms that are required to do stateful stream process-
latencies would be higher, but we would not be expressing ing. As state increases, the garbage collection becomes more
processing time. time consuming and less effective because a larger state
The last two stages of the processing pipeline are the win- needs to be kept and checked at each cycle. Finally, the shuf-
dowing stages: a tumbling window, as well as a sliding win- fling and interdependence between the observations to
dow are implemented and chained together. At the time of generate the output increases the variance of the latency.
writing, Structured Streaming did not support chaining Through customizing stateful implementations, we can
built-in aggregations. Therefore, we do not execute the built- heavily reduce the latency for some use cases. For Flink,
in sliding window stage for Structured Streaming. We can using a custom trigger for the tumbling window reduced the
circumvent this issue by using map functions with custom median latency to 1 ms with a p99 of 5 ms. This is much lower
state management and processing time semantics, as dis- than when using the default trigger because the default trig-
cussed later. The performance of the built-in tumbling win- ger forces events to be buffered till the end of the window
dow behind the join led to a large performance degradation and added an additional delay due to the burst of data that
with p99 latencies of 4,649 ms and a median latency of 2,807 needs to be processed at the end of the interval. By allowing
ms. There are a few main contributors to this 2,800 ms events to be sent out as fast as possible, the latency can be sig-
median latency. The first one is micro-batching. Despite set- nificantly reduced. Similarly, adding a processor function
ting the trigger to 0 ms to achieve processing as fast as possi- with managed keyed state and timers to do the sliding win-
ble, we still see micro-batches occurring of around 800–1,000 dow keeps the median latency at 1 ms and slightly increases
ms with occasional increases due to garbage collection. With the p99 to 215 ms. This shows that the flexibility of the Flink
a constant stream coming in this means that each event API enables optimizing the way stateful operations are done
already had an average delay of 400-500 ms before process- by e.g., reducing the time and amount of data that is buff-
ing has even started. The processing time of the batch in ered. For Kafka Streams we also notice large latency reduc-
which the event resides then adds another 1,000 ms to this tions (4x) when using the low-level API, however, not as
time. Once a record has been joined, its propagation to the large as those of Flink. Besides, it is important to keep in
tumbling window phase is delayed until event time has mind that not all processing flows can be optimized by a cus-
passed the window end and watermark interval, which tomized implementation. In this flow, this was the case
means records are buffered for another micro-batch interval. because it unnecessarily blocks events until it reaches a win-
These three factors together add up to a base latency of 2,800 dow timeout.
ms and an even higher p99 latency. As confirmed by the We can conclude that event-driven frameworks, such as
developers of Structured Streaming in [6], the micro-batch Flink and Kafka Streams, are preferred when doing simple
mode of Structured Streaming has been developed to sustain ETL operations. Built-in windowing operations reduce the
high throughput and should not be used for latency-sensi- difference between micro-batching and event-driven sys-
tive use cases. By using customized stateful mapping func- tems, shifting in favor of Spark Streaming. The type of state-
tionality with processing time semantics, we can improve ful transformation chosen has a large influence on latency.
performance since we do not rely on watermark propagation Event-driven frameworks make it possible to do interval
anymore. As can be seen in Fig. 2g, the latency of the tum- joins and stateful operations with custom triggers and state
bling window stage now has a median of 1,277 ms. Adding the management which can lead to significant latency reductions
sliding window stage, leads to a median latency of 1,704 ms. compared to built-in windowing functionality. Structured
We still see large tail latencies for both stages, going up to 3,700 Streaming also benefits from a customized implementation
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1854 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1855
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1856 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020
final batch. The main reason for this is the increased process-
ing time required to do all the computations for this burst of
data. Besides, the events that come in right after the burst
incur a scheduling delay due to this increased processing
time and, therefore, have inflated latencies as well.
In Fig. 5 we also show the performance of the default trig-
ger implementation for Flink and the DSL implementation
Fig. 5. Periodic burst workload until the tumbling window stage: perfor-
mance metrics for the last minute of the run. for Kafka Streams. We see that the impact of bursts on the
latency is the least for Flink with the median latency rising
sustainable throughput, latency measurements are only from around 800 ms to 1,250 ms and the p99 latency staying
accurate to tenth of a second. fairly constant and under 1,500 ms. Due to the lower impact
When we look at the input throughput, we can clearly dis- of bursts on the processing time of Flink, the period for which
criminate the periodic bursts of around 38 000 messages latency remains at higher levels is also shorter. When we exe-
every ten seconds. In between bursts, the input throughput cute this workload for earlier stages, such as the parse and
stays at 400 messages per second. For the tumbling window join stage (Fig. 6), we see a larger effect on the p50 and p99
stage, each output event requires on average 3.17 input latency. This can be explained by the fact that the tumbling
events. For the join stage, two input events are required. This window applies buffering which dampens latency differen-
explains the discrepancy in messages per second between ces across events. When using custom triggers, this buffering
the input and output throughput. The differences in the is reduced as can be seen in Fig. 7. We now see clear reactions
height of the output throughput rate between frameworks to bursts again and a very low latency in between bursts. Fur-
can be explained by the fact that some frameworks ingest the thermore, the p99 latency after a burst is 1,200 ms which is
burst at once while others ingest it in several chunks and out- still lower than the almost constant latency of the default trig-
put the results more gradually. ger implementation. When we use the low-level API of Kafka
In Fig. 5, the relationships between the different perfor- Streams to implement this stage, the latency of processing a
mance metrics reveal some interesting patterns. For Spark burst is similar to that of Flink and much lower than with the
Streaming, we see that after a burst of the input throughput, DSL implementation. With the DSL implementation, the
CPU levels rise and are followed by a burst in the output median and p99 latency rise considerably after a burst, up to
throughput. The latency increases for two to three seconds 2,800 ms. For the parsing stage, the latency of Kafka Streams
after the burst, which can be noted by the width of the peak. is even slightly lower than that of Flink. In contrast, we can
The median latency during the low load periods is around see in Fig. 5 that the latency of Structured Streaming exhibits
700 ms. However, the first seconds after a burst events have a very irregular course with the median fluctuating between
an inflated median latency of 1,370 ms, then it goes up to 3,000 ms and 4,500 ms and no clear reactions to bursts. This is
2,050 ms after which it falls back to 1,350 ms again for the due to the delays in watermark propagation that have been
explained in Section 5.1. When we use the customized imple-
mentation that does not use watermarks (Fig. 7), the median
latency remains under the two seconds. Structured Streaming
can sustain the largest throughput due to its dynamic micro-
batch approach and it also benefits from that when process-
ing bursts. For the join stage and the parsing stages, we see
lower and less variable latencies and no clear reactions to
bursts. Finally, we see that particularly for Kafka Streams
and Spark Streaming immediately after a burst, the distribu-
tion of the latency becomes much more narrow, with median
and p99 latencies almost converging.
For CPU utilization, we see that for most frameworks all
Fig. 6. Periodic burst workload until the join stage: performance metrics workers are equally utilized when processing bursts, with
for the last minute of the run. almost all lines in Fig. 5 coinciding. During the processing
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
VAN DONGEN AND POEL: EVALUATION OF STREAM PROCESSING FRAMEWORKS 1857
of a burst we see CPU usage of 60-90 percent for all workers, Kafka Streams have an advantage due to the interval join
while the CPU utilization throughout the rest of the run capability. This type of join can output events with a much
remains around 2-4 percent. Structured Streaming shows lower latency, compared to the tumbling window join avail-
different behavior with an average load of 6-10 percent and able in micro-batch frameworks. The latency differences
peaks between 20-40 percent mainly contributed by one between event-driven and micro-batch systems disappear
worker. This worker had almost constantly higher CPU when the pipeline is extended with built-in tumbling and
usage than the other workers in the cluster. sliding windows. For these operations, Spark Streaming
When looking at the heap usage throughout this run, we shows lower tail latencies and outperforms the other frame-
see that Structured Streaming uses the least memory. For works. However, by using the flexibility frameworks offer
Flink we see memory usage ramp up to 10 GB before a GC for these stateful operations through custom triggers and
is triggered. These GC cycles are synchronized with the fre- low-level processing capabilities, pipelines can be optimized
quency of the bursts. After a burst has been processed, GC to reach lower latency.
is initiated. For the other frameworks this behavior is not as A second important metric is throughput. Structured
explicit. The amount of memory that is used is not necessar- Streaming is able to sustain the highest throughput but this
ily important for framework performance. What is impor- comes at significant latency costs. If latency and throughput
tant, is that GCs stay effective and that memory frees up are both equally important, Flink becomes the most interest-
well after each GC, which is the case for all frameworks. ing option. When processing a single large burst at startup,
As a conclusion, we can state that using custom imple- Flink outputs the first events the soonest while Structured
mentations can not only reduce the latency of the pipeline in Streaming finishes processing the entire burst the earliest.
general but also the latency of processing bursts. When we For use cases where the input stream exhibits occasional
use only high-level APIs, Structured Streaming shows much bursts, the least performance impact is perceived when using
resiliency against bursts for the parsing and joining pipelines Flink as the processing framework. When using the default
and also uses the least resources. Structured Streaming bene- implementation, the p99 latency of Flink stays constant
fits greatly from its micro-batch execution optimizations throughout bursts while for other frameworks significant
when processing bursts, confirming [6]. Due to its dynamic increases are noted. The latency of processing bursts can
micro-batch interval, it reaches much lower latencies than again be optimized by using low-level APIs.
Spark Streaming for these earlier stages. When the pipeline Finally, the results of this paper can be used as a guideline
gets more complex Flink suffers the least from processing when choosing the right tool for a processing job by not
bursty data, followed by Spark Streaming. Detailed visual- only focusing on an inter-framework comparison but also
izations of the other workloads has been put in the Supple- highlighting the effects of different pipeline complexities,
mental File, available online. implementations and data characteristics. When choosing a
framework for a use case, the decision should be made based
on the most important requirements. If subsecond latency is
6 CONCLUSION critical, even in the case of bursts, it is advised to use highly
In this paper we have presented an open-source benchmark optimized implementations with an event-driven framework
implementation and full analysis of two popular stream such as Flink. When subsecond latency is not required, Struc-
processing frameworks, Flink and Spark Streaming, and two tured Streaming offers a high-level, intuitive API that gives
emerging frameworks, Structured Streaming and Kafka very high throughput with minimal tuning. It is important to
Streams. Four workloads were designed to cover different keep in mind that at the time of this writing parts of the Struc-
processing scenarios and ensure correct measurements of all tured Streaming API were still experimental and not all pipe-
metrics. For each workload, we offer insights on different lines using built-in operations were supported yet. For those
processing pipelines of increasing complexity and on the use cases, Spark Streaming with a high micro-batch interval
parameters that should be tuned for the specific scenario. offers a good alternative. For ETL jobs on data residing on a
Our latency workload provides correct measurements by Kafka cluster that do not require very high throughput and
capturing time metrics on a single machine, i.e., Kafka bro- have reasonable latency requirements, Kafka Streams offers
ker. Furthermore, dedicated workloads were defined to mea- an interesting alternative to Flink with the additional advan-
sure peak sustainable throughput and the capacity of the tage that it does not require a cluster pre-installed.
frameworks to overcome bursts or catch up on delay. We dis-
cuss the metrics latency, throughput and resource consump-
tion and the trade-offs they offer for each workload. By 7 LIMITATIONS AND FURTHER RESEARCH
including built-in as well as customized implementations This benchmark could be extended with joins with static
where possible, we aim to show the advantages of having a datasets and the stateful operators should be analyzed
flexible API at your disposal. under different window lengths. As well, other frameworks
A concise overview of the results is shown in Table 2. For could be included such as Apex, Beam and Storm. Finally,
simple stateless operations such as ingest and parse, we con- the fault tolerance and scalability of the frameworks under
clude that Flink processes with the lowest latency. For these these workloads are interesting directions for future work.
tasks, the newer Structured Streaming API of Spark shows
potential with lower median latencies than Spark Streaming.
However, it suffers from longer tail latencies making it less ACKNOWLEDGMENTS
predictable to reach SLAs with tight upper bounds. For the This research was done in close collaboration with Klarrio, a
joining stage, event-driven frameworks such as Flink and cloud native integrator and software house specialized in
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.
1858 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020
bidirectional ingest and streaming frameworks aimed at Giselle van Dongen (Member, IEEE) is working
toward the PhD degree at Ghent University,
IoT & Big Data/Analytics project implementations. For more Ghent, Belgium, teaching and benchmarking
information please visit https://klarrio.com. real-time distributed processing systems such as
Spark Streaming, Flink, Kafka Streams, and
Storm. Concurrently, she is lead data scientist at
REFERENCES Klarrio specialising in real-time data analysis,
processing and visualisation.
[1] J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and
V. Markl, “Benchmarking distributed stream processing engines,”
in Proc. IEEE 34th Int. Conf. Data Eng., 2018.
[2] Z. Karakaya, A. Yazici, and M. Alayyoub, “A comparison of stream
processing frameworks,” in Proc. Int. Conf. Comput. Appl., 2017,
pp. 1–12. Dirk Van den Poel (Senior Member, IEEE)
[3] S. Chintapalli et al., “Benchmarking streaming computation engines: received the PhD degree. He is currently a senior
Storm, flink and spark streaming,” in Proc. IEEE Int. Parallel Distrib. full professor of Data Analytics/Big Data at Ghent
Process. Symp. Workshops, 2016, pp. 1789–1792. University, Belgium. He teaches courses such as
[4] R. Lu, G. Wu, B. Xie, and J. Hu, “Stream bench: Towards benchmark- Big Data, Databases, Social Media and Web
ing modern distributed stream computing frameworks,” in Proc. Analytics, Analytical Customer Relationship Man-
IEEE/ACM 7th Int. Conf. Utility Cloud Comput., 2014, pp. 69–78. agement, Advanced Predictive Analytics, and
[5] S. Qian, G. Wu, J. Huang, and T. Das, “Benchmarking modern dis- Predictive and Prescriptive Analytics. He co-
tributed streaming platforms,” in Proc. IEEE Int. Conf. Ind. Tech- founded the advanced master of science in mar-
nol., 2016, pp. 592–598. keting analysis, the first (predictive) analytics
[6] M. Armbrust et al., “Structured streaming: A declarative API for master program in the world as well as the master
real-time applications in apache spark,” in Proc. Int. Conf. Manage. of science in statistical data analysis and the master of science in busi-
Data, 2018, pp. 601–613. ness engineering/data analytics.
[7] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica,
“Discretized streams: Fault-tolerant streaming computation at
scale,” in Proc. 24th ACM Symp. Operating Syst. Princ., 2013,
" For more information on this or any other computing topic,
pp. 423–438.
[8] A. Shukla and Y. Simmhan, “Benchmarking distributed stream please visit our Digital Library at www.computer.org/csdl.
processing platforms for IoT applications,” in Proc. Technol. Conf.
Perform. Eval. Benchmarking, 2016, pp. 90–106.
[9] A. Shukla, S. Chaturvedi, and Y. Simmhan, “RIoTBench: An IoT
benchmark for distributed stream processing systems,” Concur-
rency Comput. Pract. Experience, vol. 29, no. 21, 2017, Art. no. e4257.
[10] G. van Dongen, B. Steurtewagen, and D. Van den Poel, “Latency
measurement of fine-grained operations in benchmarking distrib-
uted stream processing frameworks,” in Proc. IEEE Int. Congress
Big Data, 2018, pp. 247–250.
[11] J. Kreps et al., “Kafka: A distributed messaging system for log
processing,” in Proc. NetDB, 2011, pp. 1–7.
[12] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and
K. Tzoumas, “Apache flink: Stream and batch processing in a sin-
gle engine,” Bulletin IEEE Comput. Soc. Tech. Committee Data Eng.,
vol. 36, no. 4, pp. 28–38, 2015.
[13] M. J. Sax, G. Wang, M. Weidlich, and J.-C. Freytag, “Streams and
tables: Two sides of the same coin,” in Proc. Int. Workshop Real-
Time Bus. Intell. Analytics, 2018, Art. no. 1.
[14] Flink Documentation, 2019, Accessed: Dec. 19, 2019. [Online].
Available: https://ci.apache.org/projects/flink/flink-docs-stable/
[15] P. Carbone, S. Ewen, G. F ora, S. Haridi, S. Richter, and K. Tzoumas,
“State management in apache flinkÒ : Consistent stateful distributed
stream processing,” Proc. VLDB Endowment, vol. 10, no. 12,
pp. 1718–1729, 2017.
[16] Kafka Streams Documentation, 2019, Accessed: Dec. 19, 2019.
[Online]. Available: https://kafka.apache.org/documentation/
streams/
[17] Apache Spark: Spark programming guide, Accessed: Dec. 19,
2019, 2019. [Online]. Available: https://spark.apache.org/docs/
latest/streaming-programming-guide.html
[18] Structured Streaming, 2019, Accessed: Dec. 19, 2019. [Online]. Available:
https://spark.apache.org/docs/latest/structured-streaming-
programming-guide.html
[19] cAdvisor, 2019, Accessed: Jun. 04, 2019. [Online]. Available:
https://github.com/google/cadvisor
[20] M. Caporaloni and R. Ambrosini, “How closely can a personal
computer clock track the UTC timescale via the Internet?” Eur. J.
Phys., vol. 23, no. 4, pp. L17–L21, 2002.
[21] Kafka Improvement Proposals: KIP-32 - Add timestamps to Kafka
message, 2019, Accessed Jun. 04, 2019. [Online]. Available:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-32
+-+Add+timestamps+to+Kafka+message
[22] Y. Byzek, “Optimizing your apache kafka deployment,” 2017,
Accessed: Dec. 18, 2019. [Online]. Available: https://www.
confluent.io/wp-content/uploads/Optimizing-Your-Apache-
Kafka-Deployment-1.pdf
Authorized licensed use limited to: Universitas Indonesia. Downloaded on April 11,2023 at 05:54:09 UTC from IEEE Xplore. Restrictions apply.