0% found this document useful (0 votes)
5 views16 pages

Choose The Right Stream Processing Engine Whitepaper

This whitepaper provides guidance for technology architects and developers on selecting the appropriate stream processing engine for their data needs, focusing on Kafka Streams, Spark Structured Streaming, and Flink. It discusses key technical and operational factors, including functional aspects, implementation considerations, and challenges associated with streaming data. The paper emphasizes the importance of understanding the specific use cases and requirements of an organization to make informed decisions about stream processing solutions.

Uploaded by

Srijan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

Choose The Right Stream Processing Engine Whitepaper

This whitepaper provides guidance for technology architects and developers on selecting the appropriate stream processing engine for their data needs, focusing on Kafka Streams, Spark Structured Streaming, and Flink. It discusses key technical and operational factors, including functional aspects, implementation considerations, and challenges associated with streaming data. The paper emphasizes the importance of understanding the specific use cases and requirements of an organization to make informed decisions about stream processing solutions.

Uploaded by

Srijan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

WHITEPAPER

Choose The Right Stream Processing


Engine For Your Data Needs
Technical and operational factors that are
crucial to the decision making process
Table of Contents

Process Data Streams at the Speed of Business and at the Scale of IT  3

Address Challenges Through Informed Decision Making  4


Streaming Challenges  4
Decision Making Process: Technology and Operational Considerations  5

Technology Considerations for Stream Processing Engine Evaluation  5


Functional Aspects 5
Developmental Control  6
Implementation and Beyond  6

Operational Considerations for Stream Processing Engine Evaluation  8


Enterprise Adoption  8
Enterprise Operations 8

Spark, Kafka, or Flink? Which to Use?  9

Flink Use Cases  10


Cybersecurity and Log Processing  10
Outage Classification for Telecom Companies  11
Financial Services: Mainframe Offloading  11
IoT for Manufacturing  12
Fraud Monitoring for banking  12

Technical Features Table  13

Operational Features Table  14

Customer Success  15

Ensure Fit for Purpose and Enterprise Wide Adoption 15

About Cloudera  16

2 Choose the Right Stream Processing Engine for Your Data Needs
Process Data Streams at This paper also highlights some of the capabilities that
are key to any data streaming use case, such as:
the Speed of Business and
• Watermarks to handle late and out of order delivery;
at the Scale of IT
• Windowing semantics to structure the streams;
Business opportunities that directly impact revenue
or boost operational efficiency need to be addressed • Complex event processing; and
in near real-time. Digital transformation initiatives and
the advancement in mobility, IoT and streaming
• Capabilities that enhance operational efficiency.
technologies has led to enterprises being inundated Cloudera offers all of the engines listed here,
with data. Key business requirements determine because we believe that you should use the best tool
how such high volumes of high-speed data should for the job. Sometimes that tool is a very simple one,
be processed in real-time to provide actionable but more often than not, you will need the advanced
intelligence. This directly leads to IT having to evaluate capabilities for your specific use cases.
which stream processing engine is best fit for purpose
for their enterprise needs. Other determining factors There are a variety of ways by which to address data
include return on investment, its dexterity to be applied stream processing challenges. The solution comes
across multiple use cases, and its level of maturity for down to the fundamental way in which the engine works
enterprise-wide adoption. and how your organization implements it.

This paper is meant to help technology architects and


developers choose the right stream processing engine
for their needs. We do this by analyzing key technical
and operational differentiators between three modern
stream processing engines from the Apache open
source community:

• Kafka Streams;
• Spark Structured Streaming;
• Flink. Cloudera Manager provides one view to manage all of your resources including
stream processing engines. Here we see Flink (1), Kafka (2), and Spark (3) resources
in one comprehensive view. Source: Cloudera.

3 Choose the Right Stream Processing Engine for Your Data Needs
Address Challenges Through Event time and processing time — The chances that
streaming events come in without any delays and with
Informed Decision Making predictable patterns is low, because you can’t control
Global digitization has resulted in a vast array of new the myriad of input sources that exist across collections
products and services with such high levels of of networks that vary in type and quality. Even with the
convenience that it fuels a continuous loop of greater very best networks and the fastest collection mechan­
expectations for immediacy. Next day delivery and isms, there will always be latency between the time
real-time payments are demands driven by consumers an event happens in the real world (event time) and the
at the point of service that then pressures downstream time your system processes it (processing time).
services to respond faster. Processing and analyzing Bounded and unbounded streams — Bounded streams
billions of events per second across geographies is have a beginning and an end, so it is easy to reason
becoming an ordinary affair. about time and correctly sort events, akin to batch.
In response, technology teams have been pivoting Unbounded streams are harder to reason about
from large monolithic database architectures to because, without an end, you don’t know if another
event-driven applications and microservices design live event is yet to come. Calculations, aggregations
as a way to reduce the inevitable latency of inputs and or pattern detection in unbounded streams is very
outputs across networks by bringing the state of an tricky. To handle both scenarios, it is helpful to follow
event closer to the application itself. a “streaming first” principle (see sidebar next page),
and to consider capabil­ities like watermarks to handle
Central to this effort are modern data stream processing late and out of order events (see sidebar next page).
engines like Kafka Streams, Spark Structured Streaming,
and Flink. Of these three, Flink is the oldest, while Kafka Simple and complex events — Complex events are
Streams is the newest. The Spark Structured Streaming derived from simple events that have been aggregated,
community is large while Flink’s is growing rapidly. patterned, and evaluated to trigger a response or
Knowledge of an engine’s development community present a result, often on data that continuously moves
can help gauge how self-sufficient and productive your under your feet. Decisioning on unbounded streams
team can be. requires the state of events to be stored and analyzed.

The engine that is best for you depends on your organiza­ Stateless and stateful — Stream processing engines
tion’s use cases, team makeup, and various technology excel when analytics require a reassessment of events
and operational factors. This paper is meant to help you within the context of time. That is considered stateful,
in that evaluation process. while stateless represents a self-contained fire-and-
forget paradigm. There are acceptable trade-offs
Streaming Challenges between stateless high throughput engines and
stateful engines that need to address aggregation,
The following are reminders of streaming challenges
enrichment, and other requirements.
you’ve undoubtedly had or will come across.

Bounded and Unbounded Streams


Bounded Bounded
Stream Stream

Start of Past Now Future


the Stream

Unbounded Stream

Unbounded Stream

4 Choose the Right Stream Processing Engine for Your Data Needs
Decision Making Process: Technology
and Operational Considerations The “Streaming First” Approach
There is often an over reliance on streaming bench­
Stream processing engines have followed
marks when choosing a stream processing engine.
different paths in their approach to solving
These benchmarks primarily concentrate on latency,
unique time reasoning challenges.
throughput, and hardware utilization, neglecting
functional requirements and the level of control Flink is a “streaming first” distributed system.
developers possess when implementing a solution. This means that it has always focused on solving
Additionally, benchmarks often overlook crucial the difficult unbounded stream use cases over
operational, staffing, and other nonfunctional criteria. bounded stream and batch scenarios.
The remainder of the paper outlines vital technological
and operational factors necessary for making informed It turns out that algorithms that work on
decisions and encouraging enterprise-wide adoption unbounded streams, also work on bounded
of the chosen solution. streams by treating the latter as a special case
of the former. As a result, Flink addresses
Technology Considerations micro-batch use cases as well.

for Stream Processing


Engine Evaluation
The next section compares the different stream Watermarks To Handle Late Delivery
processing engines with regard to functional,
developmental, and implementation considerations. Watermarks offer a robust method for handling
late or out-of-order arrivals by providing a set
Functional Aspects of trigger messages that are injected alongside
The functional capabilities of stream processing the data stream.
engines as they pertain to approach, streaming For unbounded streams, in which you don’t
model, and time support are used to solve specific have a definitive end, watermarks delineate
business requirements. points at which you would expect all of the
Approach — The type of approaches that develop­ events to have occurred. It is from here that
ment communities took at the inception of an engine’s you can establish some logic.
development include: “streaming first”, “message A collection of watermarks forms windows,
broker first”, and “batch first”. The distinction helps in providing structure to your data streams and
understanding what the engine was originally meant for. enabling the application of reasoning (refer to
Flink took a “streaming first” approach and is regarded Window Semantics in the sidebar on page 7).
as being the modern leader in this space. Flink provides extensive control over water­
Spark Structured Streaming followed a “batch first” mark generation. This provides more options
approach, while Kafka Streams was initially developed as to how completely you want to capture
as a “message broker first”. Streaming capabilities events that may or may not arrive. This mechan­
are popular add-ons for both. ism can also extend to very sophisticated
functionality like leveraging upstream and
Streaming Model — Earlier, we described the concept downstream materialized views or using batch
of “stateless” and “stateful” and that it is critical to engines to reprocess and incorporate late data.
distinguish between the two, and the trade-offs
between throughput and latency.

5 Choose the Right Stream Processing Engine for Your Data Needs
Flink, Kafka Streams, and Spark Structured Streaming cantly and is now fully open-source. However it is not
are all stateful, but with slight differences. Having taken ANSI-compliant nor as feature rich as Flink’s offering.
a “batch first” approach, Spark Structured streaming Spark Structured Streaming SQL, is well adopted, and
handles events as micro-batches and it excels when has ANSI compliance dating back to 2003.
high throughput is necessary but low latency is not a
big requirement. The two other stateful engines differ Implementation and Beyond
in how they store state. Kafka Streams depends on the Application development is only as good as its implemen­
Kafka ecosystem, while Flink provides more storage tation. Below, we cite aspects that need to be considered
options. Both process messages one event at a time to move beyond the idea and development stage.
and are considered low-latency solutions.
Delivery Guarantee — This is a key factor to consider
Time Support — All of the stream processing engines as it relates to your expectations of latency, throughput,
described in this paper are able to distinguish event correctness, and fault tolerance of message delivery.
time from processing time. The nuance is in how much
control you have to address some of the trickier use At-least-once delivery ensures that a message is
cases. Flink provides a great deal of control with capa­ delivered at least once, although multiple delivery
bilities such as watermarking and session windows attempts may result in duplicates. This approach offers
(see sidebars on page 5 and page 7). high performance and minimal overhead since delivery
tracking state is not maintained. Operating in a fire-
Developmental Control and-forget manner, at-least-once delivery satisfies
A common task in every data processing use case is low latency and high throughput requirements while
to import data from one or multiple systems, apply guaranteeing message delivery, but may not sufficiently
transformations, and then export results to another address data duplication concerns.
system. Considering the ubiquitousness of streaming Some applications, such as financial transactions, have
data applications, unified integration with machine stricter demands and may require that messages are
learning, graph databases, and complex event process­ received and processed exactly once. This requires
ing is becoming more common. retries to counter transport losses, which means keeping
Processing Abstraction — To help your engineering state at the sending end and having an acknowledge­
team be productively focused on business logic ment mechanism at the receiving end. Exactly-once
instead of advanced streaming concepts, it’s is optimal in terms of correctness and fault tolerance,
important to evaluate the stream processing engine’s but comes at the expense of added latency.
abstraction capabilities. All the engines described in this paper provide exactly-
Spark Structured Streaming has a rich set of libraries once delivery guarantee, though Kafka Streams is limited
to implement machine learning use cases. If you are to the Kafka ecosystem and can’t control downstream
already developing within a Spark ecosystem with systems. Flink and Spark Structured Streams guarantee
frameworks such as MLlib, choosing Spark Structured exactly-once delivery from any replayable upstream
Streaming as your stream processing engine will make source but also with downstream platforms in some
it for a much easier adoption. cases, if they have transactional support.

Special attention should be paid to the engine’s SQL State Management — The aforementioned trade-off
abstraction. From an analytics democratization point between exactly-once delivery guarantee and the
of view, SQL abstraction is a very important basis for inevitable latency of state storage may drive the
comparison. While many senior developers prefer selection process based on the state management
sophisticated languages like Scala for intricate analytical capabilities that come with the engine.
tasks, the expressiveness and simplicity of SQL can get Kafka Streams is good for things that are Kafka centric,
the job done more easily and it is accessible to a wider because it tends to rely heavily on Kafka storage for
range of developer, and even business, resources. state. Like Flink, it uses a local RocksDB but checkpoints
This accessibility is a critical factor in enabling domain the state as a Kafka topic and that limits flexibility as
ownership of data apps. to how you store and access the history of that state.
When it comes to the comparison on the basis of SQL, Within a Kafka ecosystem, a good linear access
the more standard the better. Flink has the most mature mechanism is provided, making everything nice and
and production tested OpenSource SQL-on-Streams tame. This works great for simple use cases, but
implementation and is fully ANSI compliant. Kafka’s it doesn’t provide the flexibility and operational
ksqlDB, formerly known as KSQL, has matured signifi­ capabilities that some of the other engines do.

6 Choose the Right Stream Processing Engine for Your Data Needs
Apache Flink provides more complete state manage­ Leveraging Apache Flink’s advanced checkpointing
ment capabilities compared to Kafka Streams, offering capabilities offers a multitude of operational
features such as native support for different state advantages, such as seamless job versioning,
backends, efficient checkpointing, and built-in fault effortless cluster migration, streamlined application
tolerance mechanisms. While both Flink and Kafka and infrastructure upgrades, and the flexibility to
Streams can handle stateful stream processing, Flink’s transition workloads between cloud and on premises
state management system is designed to scale across environments. As a testament to its efficacy, this
distributed environments seamlessly, allowing for robust approach is increasingly being embraced by
automatic state redistribution and recovery during the wider stream processing engine community,
rescaling or failures. This makes Flink particularly further solidifying Flink’s position as a trailblazer in
well-suited for large-scale, complex, and state-heavy stateful stream processing.
streaming applications that require high levels of fault
tolerance and flexibility in state management.
Window Semantics
Fault Tolerance/Resilience — The demand to mitigate
operational disruption is so strong that the concept Flink allows you to customize a window
of “resiliency” attracts regulatory oversight across structure so you are not limited to pure linear
industries. Streaming architecture capabilities such time. By utilizing Session Windows, for instance,
as checkpointing, savepoints, redistribution, and state you can define windows based on gaps
management (see above) are crucial to the stream between events or the number of events. This
processing engine selection process. offers considerable flexibility in assigning
events to different windows before processing.
Spark Structured Streaming has built-in capabilities,
while Kafka Streams requires you to “build your own”, Windows are logically essential for analytics,
using ZooKeeper to replace a failed broker for example. as they provide the structure upon which
analysis is based.
Flink’s fault tolerance mechanism uses checkpoints
to draw consistent snapshots to which the system Two other windowing examples are tumbling
can fall back in case of a failure. The aforementioned windows (that slice a stream into even chunks)
state management capabilities ensure that even in the or sliding windows (that enable your aggrega­
presence of failures, the program’s state will eventually tions and analytics to move with time).
reflect every record from the data stream exactly once.

The Difference Between Tumbling and Sliding Windows


Source: Cloudera
Time
0 1 2 3 4 5 6 7 8 9 10

Tumbling
Windows

Sliding
Windows

7 Choose the Right Stream Processing Engine for Your Data Needs
Kafka Streams doesn’t provide a scheduler or full
deployment framework out-of-the-box. While it
Complex Event Processing (CEP)
provides efficient ways of writing simple applications,
Processing real-time events and extracting you are left to your own devices on how to launch, run,
information to identify more meaningful events, orchestrate and operate those applications.
like understanding behaviors, probably ranks as
Community Maturity and Documentation — To ensure
one of the most interesting streaming use cases.
that your developer resources are self-sufficient and
Flink’s statefulness and window handling productive, the maturity of the developer community
capabilities is the foundation on which and quality of documentation is a very important
advanced CEP is crafted. What makes Flink all aspect to consider.
the more compelling is that CEP is accessible
Kafka Streams is relatively young with very strong
to a wider range of developer resources through
community growth and extensive documentation
standard SQL abstraction.
and examples. While the Spark Structured Streaming
For example, the Match_Recognize SQL community is large and busy with extensive documen­
statement can be very helpful when you are tation and examples, they still need help from more
looking for patterns built up through sequences reviewers and committers, something that Cloudera
of events that can’t be distinguished by simple is helping to drive forward.
counting methods.
Flink is the fastest growing community with strong
The standard SQL abstraction of Flink makes it research and production deployments. Documentation
a compelling choice for use cases that require and working examples are good and will broaden
“simple” complex event processing. considerably as the community matures. “For the 5 year
period from 2017-2022 Flink has been the most active
community in terms of code commits and community
discussion”. Also, some of the biggest brand name
Operational Considerations companies have already invested in large deployments
for Stream Processing of Flink for their real-time stream processing needs.

Engine Evaluation Enterprise Operations


All organizations look to control costs by doing more If you are looking to establish a legacy of successful, fit
with less. Budget approval for your selected stream for purpose data streaming solutions, it is important to
processing engine may be contingent on its utility know that you’ve selected an engine that completely
across streaming pipelines, reusability of talent, and integrates into your organization’s security framework,
synergizing tech stacks. provides comprehensive monitoring and metrics, and
can scale up and down in line with business demand.
Enterprise Adoption Enterprise Management — At Cloudera, we’ve
The best application is no good if it can’t be efficiently invested a good deal of our time to integrate the
and safely deployed across your organization, or if stream processing engines described in this paper into
there is a dependency on hard-to-find development Cloudera to make sure that they are all enterprise ready
talent. Effective solutions are those that can be adopted for their respective purposes. Both Flink and Spark
across the entire enterprise. Structured Streaming have rich Operations Support
Deployment Model — There is a better chance of Systems (OSS) with enhanced vendor offerings. Kafka
adoption if teams don’t have limited deployment Streams has minimal OSS via some vendor offerings.
options. Flink can be deployed in clustered, Kubernetes, Over time, Cloudera has enhanced Spark Structured
YARN, Kafka, HDFS, Docker, S3, and microservices Streaming with important metadata, logging, metrics,
environments, while Structured Streaming and Kafka and operational capabilities that are also starting to
Streams are more limited (see Technical Features find its way to Flink but, for now, the former has the
Table, page 13). Kafka Streams is the most lightweight edge on Cloudera.
for micro­services, at the cost of out-of-the-box
features, and the Flink library is nearly as lightweight.

8 Choose the Right Stream Processing Engine for Your Data Needs
As it relates to Kafka, the original security implemen­ Scaling Up / Scaling Down — Another consideration
tation was done by Cloudera (via integration with Apache is that streaming workflows tend to be multi-modal
Ranger for role-based authorization and auditing) and and unbalanced throughout the day, so the scaling
still provides leading security capabilities via scale- capabilities are absolutely crucial. Flink and Spark
tested role-based and attribute-based access control Structured Streaming are developing auto-scaling
models, and integrated data governance with Apache approaches to automatically maintain steady and
Atlas in Cloudera. predictable performance. They each have a solid
orchestration platform underneath, which tends to
The other advantage of Cloudera for these stream
give them an edge over the “build your own” type of
processing engines is the fine-grained integrated
approach that you get with Kafka Streams.
single pane of glass security control.

Spark, Kafka, or Flink? Which to Use?


Boiling this down to high level guidance and decision making points, below is the heuristic that Cloudera tends to
work with when advising customers.

Spark Structured Streaming • Spark Structured Streaming is already standard


is best for developer acces­ at your organization.
sibility and whole platform
• You need a unified batch/stream solution.
solutions where low latency
and advanced streaming are • T he highest levels of throughput is crucial .
not required, e.g. combining • Low latency is not necessary.
batch and stream where
response time is measured • Advanced time/state features would be overkill .
in seconds to minutes.

Kafka Streams is best • You only need microservices.


for Kafka Streams-only
• T hroughput is essential.
architectures without
advanced streaming features. • Low latency is crucial.
• Time/state features are not needed out-of-the-box.
• A pplication operational and resilience require­
ments are simple or handled elsewhere. Both Flink and
Kafka can
be used as
Flink is best for covering the • You need flexibility across microservices, batch libraries in
full range of streaming pipeline and streaming. micro­service
requirements. architec­tures.
• High throughput is necessary.
• Low latency is crucial.
• Use cases call for advanced windowing and
state capabilities.
• You are not scared of new solutions, especially
those that are best-in-class.

9 Choose the Right Stream Processing Engine for Your Data Needs
Flink Use Cases
Operational Efficiency To get a practical understanding of how Flink is used
How would you understand what contributed in the real world, we have described a variety of use
to an unexpected value in a complex calcula­ cases below.
tion while data continues to stream in?

Use the state processing API in Flink to recreate


states as transactional snapshots and then dig
in as much as needed to explain the origin and
reasoning behind that dynamic calculation.

Checkpoints and savepoints are key to


understanding how this works.

A Checkpoint provides a recovery mechanism


in case of unexpected job failures. It creates
a consistent image of the streaming job’s
execution state called a savepoint. You can
use savepoints to stop-and-resume, fork, or
update your jobs.

A fundamental aspect of Flink since its incep­


tion has been the separation of the state into
local pieces that are linked together through
consistent checkpointing levels. This minimizes
latency but also provides all sorts of opera­
tional efficiency gain. For example, you can:

• Deploy new application versions that are


preloaded from the current production state.

• Do accurate A/B testing of new algorithms


because you can easily bring up new
instances against a solid starting point. Cybersecurity and Log Processing
To deal with supersets of data, the states saved A classic streaming data challenge is to identify and act
locally in a high performance key value store upon intrusions and fraudulent events that are hidden
(RocksDB) are checkpointed down to HDFS, a within terabytes of dynamic machine logs. Throughput
cloud blob store, or other durable repository. and latency are obviously important factors to consider
because you want to identify an issue as quickly as
possible. But effective action against criminals that
doesn’t alienate good customers requires understanding
the behaviors of both good and bad actors.

The robust state management features of Flink serve as


a cornerstone for cutting-edge cybersecurity solutions.
By harnessing the performance advantages of localized
state, Flink empowers efficient enrichment functions
and intricate event processing. Advanced session
windows and watermarking techniques enable in-depth
analysis of dynamic data streams, facilitating thorough
investigations for a secure digital landscape.

10 Choose the Right Stream Processing Engine for Your Data Needs
Outage Classification for customer relationship functions from mainframes to
Telecom Companies agile stateful stream processing engines like Flink.
For decades, telecom companies have focused This shift enables tailored product offerings based on
on their network infrastructure and preventative real-time spending patterns, delivering an enhanced
maintenance but often as a reaction to past events. and personalized customer experience.
Customer and regulatory demands require a dynamic A key driver of success is data consistency. Flink’s
approach to predict and mitigate spotty performance exactly-once delivery guarantee ensures that data is
and outages. processed exactly once, even in the event of failures.
Adapting network strength and mass availability This is critical for applications that need to track
is crucial and requires aggregated analysis on vast customer spending behavior and balances, as well as
amounts of data over a wide array of networks to find those that need to make real-time decisions based on
anomalies, predict where failures are likely to occur, or the latest data.. Flink’s ability to combine exactly-once
even just record the state of their current network at any delivery with complex event processing (CEP) makes
point of time. it a powerful tool for real-time marketing. By analyzing
real-time data streams, Flink can identify patterns that
5G has only increased the volume and variety of indicate when a customer is likely to be interested in a
metrics available, so having the ability to scale and particular product or offer. This allows banks to deliver
perform analytics on incoming events quickly is the right offer at the right time, which can lead to
absolutely critical to identifying problems before increased customer engagement and sales. Flink’s
customers do. ability to perform data access, enrichment, and
decisioning locally eliminates the need to access
Financial Services: Mainframe Offloading external data sources. This can significantly improve
Today’s consumers expect swift and seamless service response times, as data does not need to be trans­ferred
experiences. Traditional, overburdened mainframes over the network. This is essential for applications that
struggling with low-latency user interactions often require millisecond-level response times, such as fraud
hinder banks’ efficiency, especially in the era of Open detection and risk assessment.
Banking directives, which require them to share
customer data with third-party providers. To address
this challenge, financial institutions are transitioning

11 Choose the Right Stream Processing Engine for Your Data Needs
IoT for Manufacturing Fraud Monitoring for Banking
IoT devices streamline supply chain operations within Today’s financial services demand instant approval
a manufacturing facility. Today, manufacturers are which requires data from various sources while
leveraging advanced monitoring sensors and real-time transaction streams carry ever increasing volumes of
technologies to track quality of goods, automate the data. Fraudsters try to exploit these conditions and
visual inspection of goods, and customized manufac­ continue to evolve tactics to try to stay one step ahead.
turing for individual partners.
Flink’s ultra-low latency unified processing makes it
Flink’s advanced windowing and state capabilities possible to process transaction streams in real time
make it an ideal platform for processing sensor data in while executing table look-ups for historical customer
manufacturing. By aggregating and comparing sensor data. In order to identify fraudulent patterns, trans­
data over time, Flink can identify patterns that indicate actions must be processed in context of customer
potential problems with machines. This allows manufac­ profile and transaction history. Furthermore, with Flink
turers to take corrective action before a problem SQL API’s providing a useful abstraction layer, fraud
causes a costly outage. Additionally, Flink’s flexibility monitoring logic can be expressed in the language of
makes it a good fit for the different use cases in data, making the capabilities of Flink more accessible
manufacturing IoT. Flink can be used to process data to less technical fraud analysts so they can adapt their
from both streaming and batch sources, and it can be monitoring techniques to stay ahead of the fraudsters.
deployed as a microservice or a standalone application.
This makes it easy to integrate Flink with other manufac­
turing systems. Overall, Flink is a powerful platform for
processing sensor data in manufacturing. Its advanced
windowing and state capabilities, as well as its flexibility,
make it a good fit for the diverse needs of manufacturing.

12 Choose the Right Stream Processing Engine for Your Data Needs
Technical Features Table
The table below gives technical comparison across four modern stream processing engines. Refer to it when evaluating
the functional and developmental aspects of your project.

Flink 1.16 Kafka Streams 2.4 Spark Structured


Streaming 2.4

Approach, ● •Streaming first ◆ •Message-broker first ◆ •Batch first


position •Modern class-leader •Popular streaming add-on •Popular streaming add-on

Streaming model, ● •Stateful (First class requirement) ● •Stateful ◆ •Stateful


throughput, type •<500 milliseconds •<500 milliseconds  •Greater than 1 second
•Event-at-a-time •Event-at-a-time •Microbatch

Time support ● •Event time ● •Event time ● •Event time


•Processing time •Processing time •Processing time
•Customizable for greater control

Processing ◆ •Table ◆ •Table ● •Table


abstractions •SQL (ANSI Standard) •SQL-like DSL (KSQL) (not ANSI •ANSI SQL:2003 in Spark
•Complex event processing compliant) Structured Streaming 2.3

•Graph •No batch •Graph

•Machine learning •Machine learning

•Batch (experimental) •Unified APIs for batch and stream

Delivery ● •Upstream: Exactly-once ◆ •Upstream: Exactly-once Kafka ● •Upstream: Exactly-once


guarantee •Downstream: Some capabilities Streams only •Downstream: Some capabilities
depending on the downstream •Downstream: No depending on the downstream
system system

State ● •RocksDB ● •BYO RocksDB ◆ •RocksDB Databricks only


management •Configurable snapshots •Snapshot to Kafka Streams topic •OSS sync on HDFS
•Queryable •Queryable

Fault tolerance ● •B uilt-in ◆ •BYO Microservice ● •B uilt-in


and resilience •Checkpoints
•Savepoints
•Redistributable

● Great fit for purpose ◆ Fits with some work ▶ Fits with a lot of work

13 Choose the Right Stream Processing Engine for Your Data Needs
Operational Features Table
The table below gives an operational comparison across four modern stream processing engines. Refer to it when
evaluating the nonfunctional aspects of your project.

Flink 1.16 Kafka Streams 2.4 Spark Structured


Streaming 2.4

Deployment ● •Clustered ◆ •Not clustered ◆ •Clustered


model •Kubernetes •Kubernetes •Kubernetes
•YARN •Microservices
•Kafka
•Docker
•S3
•Microservices

Documentation ◆ •Good technical documentation ● •E xtensive documentation ● •E xtensive documentation
•Growing examples •E xtensive examples •E xtensive examples
•Stack overflow coverage •Stack overflow coverage •Stack overflow coverage

Maturity/ ● •Smaller but fastest growing ● •Newest, strong community with ● •Spark Structured Streaming
community community with strong research strong growth community is strong, but
and production deployments Streaming is a small, quiet corner

Use cases ● •Unbounded and ● •Microservice/event driven, ● •Unified ETL, semi-RT processing
bounded streams embedded in another
•Batch application

•Complex event processing


•IoT
•Microservices
•Others

Enterprise ● •Rich OSS ◆ •Minimal OSS ● •Rich OSS


management •E nhanced vendor offerings •Some via vendor offerings •E nhanced vendor offerings

Push button ◆ •Complex ● •Simple, some OSS support, ● •Complex


security •Some OSS support good vendor offerings •Good OSS support
•Limited vendor offerings •Good vendor offerings

Logging/metrics ◆ •Usual OSS integrations, some ◆ •BYO microservices ● •Good logging integration
vendor offerings

Scaling up/down ● •Not yet autoscaling, but all ◆ •BYO microservice, scaling limits ● •Not yet autoscaling, but all
requirements available e.g.: shuffle sort requirements available

● Great fit for purpose ◆ Fits with some work ▶ Fits with a lot of work

14 Choose the Right Stream Processing Engine for Your Data Needs
Customer Success Ensure Fit for Purpose and
To provide insights into the business impact that can Enterprise Wide Adoption
be drawn through a comprehensive data-in-motion
Streaming and time based reasoning applications
solution, we provide these customer success examples.
are confronted with both simple and complex sets of
1. An international communications company serving challenges. Functional business requirements dictate
consumers and businesses in ten countries, the manner in which data must be processed, thereby
deployed the Cloudera streaming data platform guiding the evaluation and selection of the most
to tackle a variety of critical use cases including, suitable stream processing engine to meet your
stream processing, log aggregation, large-scale specific needs.
messaging and customer insights.
This paper described a number of capabilities that
Results included improved overall customer would address both simple and complex scenarios,
experience through strategic use of data analysis, while keeping in mind acceptable trade-offs. You
reduced infrastructure management costs and don’t want to over-engineer a solution, but you want to
TCO, and enabled real-time actions to improve know that it can grow to support an evolving business.
business outcomes. To support that growth, there are a number of technical
and operational factors that are crucial to the decision
2. A large European bank specializing in agriculture
making process.
financing and sustainability oriented banking
across global markets leveraged Cloudera’s We also suggested that you take a broad perspective
streaming data platform to run sophisticated that also considers nonfunctional aspects such as how
real-time algorithms and financial models to help your team can deliver on the solution’s promise, how it
customers manage their financial obligations, integrates into your organization’s security framework,
including loan repayments. operational processes, support structure, and how it
can scale up and down in line with business demand.
By implementing the platform and gaining the
ability to stream real-time data, the bank can now In summary, the view expressed here will help ensure
detect warning signals in extremely early stages that you choose the right stream processing engine
of where clients may go into default. Through their that is both fit for purpose to the business challenge
new, governed data lake, the bank’s account at hand, and will also enjoy enterprise wide adoption.
managers are also able to access an in-depth
overview of customer data, enabling them to
generate liquidity overviews and advise customers
on how to avoid defaulting. Through rapid data
processing, better models are created that more
accurately predict warning signals.

3. Afterpay is a frictionless payment solution for


customers. By providing instant loans, AfterPay is
able to give customers a convenient “buy now, pay
later” option. The speed at which data is captured,
streamed and processed is what allows AfterPay to
offer this service to customers. Behind the scenes
AfterPay is processing up to 800 million transactions
daily that are all processed by the fraud monitoring
team before approvals are made. The speed of Flink
data processing helps AfterPay stop bad actors
who would seek to initiate two loan requests in
different geos simultaneously. This is completely
unbeknownst to the customer who gets an instant
no hassle approval and is now able to use their loan
at the point of sale. The data from this transaction
is then also persisted in a large scale, redundant
cloud database for future analysis and to create
behavioral models or simply for letting the
customer know what they purchased in the past.

15 Choose the Right Stream Processing Engine for Your Data Needs
About Cloudera
Cloudera is the only true hybrid platform for data,
analytics, and AI. With 100x more data under
management than other cloud-only vendors,
Cloudera empowers global enterprises to transform
data of all types, on any public or private cloud,
into valuable, trusted insights. Our open data
lakehouse delivers scalable and secure data
management with portable cloud-native
analytics, enabling customers to bring GenAI
models to their data while maintaining privacy
and ensuring responsible, reliable AI deployments.
The world’s largest brands in financial services,
insurance, media, manufacturing, and government
rely on Cloudera to be able to use their data
to solve the impossible — today and in the future.

To learn more, visit Cloudera.com and follow


us on LinkedIn and X. Cloudera and associated
marks are trademarks or registered trademarks
of Cloudera, Inc. All other company and
product names may be trademarks of their
respective owners.

Cloudera, Inc. | 5470 Great America Pkwy, Santa Clara, CA 95054 USA | cloudera.com

© 2025 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries.
All other trademarks are the property of their respective companies. Information is subject to change without notice. WP_011_V2 January 9, 2025

You might also like