Buyers Guide - Decoding The Top 4 Real-Time Data Platforms Powered by Apache Flink
Buyers Guide - Decoding The Top 4 Real-Time Data Platforms Powered by Apache Flink
2 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
The Rise of Apache Flink
Apache Spark and Apache Flink stand out as robust open-source frameworks for
processing streaming data, but they do so in fundamentally different ways due
in large part to their origins and primary intended use cases. Spark was originally
developed as a parallel data processing engine for at-rest data, while Flink was
developed as a streaming-first parallel data processing engine. While Spark has
developed streaming capabilities and Flink has developed batch capabilities, the
depth of these capabilities still differ significantly.
Language and API Support. Both Flink and Spark support SQL, Java,
Scala, and Python. Both also provide a layered API, offering different levels
of flexibility/convenience tradeoffs to support stream processing use
cases. However, the Spark API only supports a basic set of operations,
whereas Flink provides a much richer set of lower level processing
primitives that gives users access to window punctuation, timers, and
event state for handling late arriving data.
3 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Consequently, Apache Flink has emerged as the de facto standard for stream
processing, gaining widespread adoption among some of the industry’s most prominent
innovators, including Alibaba, Uber, and Netflix. Continuously growing in adoption due
to its performance, fault tolerance, and scalability, Flink has become the backbone of
many real-time data processing pipelines.
The open-source nature of Flink offers unparalleled flexibility and innovation potential,
but it also presents a significant challenge for organizations: the need for a highly
skilled and specialized workforce to effectively use its capabilities. Building and
maintaining the infrastructure required to support real-time applications powered
by Flink demands a team of expert engineers proficient in areas such as distributed
systems, data engineering, and stream processing. It is also necessary to have
specialists to integrate other projects and their capabilities, such as Debezium to
support change data capture (CDC).
This poses a dilemma for many organizations, as the reality is that only a select few
possess the resources and expertise necessary to tackle the complexities of Flink
at scale. For the vast majority of companies, this represents a formidable barrier to
entry for supporting real-time data processing. In response to this challenge, many
companies turn to vendors and service providers who offer tailored solutions to simplify
the adoption and management of Flink-based applications. By using the expertise and
resources of these vendors, organizations can overcome the barriers posed by Flink’s
complexity and focus on using its capabilities to drive innovation and growth.
Let’s explore these options along with what is offered at the different levels, what
remains for you and your team to create and manage after adopting each one, and
the pros and cons of each.
4 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
A Comprehensive Checklist
Successfully running a data infrastructure platform at scale in production requires the combined
capabilities of many different layers of functionality that build on each other:
Cloud Infrastructure
Fully hosted or “bring your own cloud”
Debezium
Job control
State management
Job optimization
Multi-stream connections
Data catalog
Developer Experience
Unified API
Web UI
5 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
OPTION #1
Do It Yourself on Cloud Infrastructure
For companies who choose to pursue the DIY approach and build and run their own
data platform from scratch, this most commonly starts with a cloud service provider
to provision the infrastructure layer (IaaS).
At this level, you’re getting the raw compute, storage, and networking services, as
well as provisioning capabilities such as Kubernetes. The rest of the architecture,
including deploying Flink as well as building and managing all the other aspects of
your data platform, is entirely up to you.
Figure 1. Architectural components of a DIY approach to a real-time ETL and stream processing platform.
PROS: CONS:
• Flexibility: You have full control • Complexity: Building and maintaining a real-time data
over your data platform and can platform requires expertise in areas like SRE and data
customize it according to your platform engineering.
specific needs. • Cost Risk: Unforeseen issues can eclipse the initial cost
savings achieved in the early stages of deployment.
• Cost Savings: You can
potentially save costs by • Responsibility: You’re responsible for ensuring the reliability,
managing your own platform security, and optimization of your platform.
and underlying infrastructure • Resource Intensive: It requires dedicated staff and
efficiently. resources to manage the platform effectively.
• Support: Beyond the raw infrastructure, it’s up to you to
support Flink, Debezium, and any other open-source tools
that you deploy.
6 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
OPTION #2
Flink-as-a-Service
Some cloud service providers offer a basic hosted Flink service. While this is
incredibly useful, it is not generally sufficient for the majority of customers and leaves
quite a number of challenges for teams to solve on their own. You are being asked to
handle the connectors, to make sure you get the versions right while keeping up with
security patches, and to configure many of the details of Flink. State management,
optimization, observability, security, and compliance—these are all tasks that you’ll
need to provision and resource a data platform engineering team to handle. In
addition, there is no tooling or developer experience support provided, which will
need to be created internally as needed.
PROS: CONS:
7 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
OPTION #3
Managed Provider
Managed Flink providers go even higher up the data stack. In addition to Flink, they may
offer Debezium and other connectors, and they can provide services such as job control,
resource management, and security features.
Again, while these are good offerings, they still do not provide the complete picture that
customers want and need in order to be successful. For instance, a managed provider
might give you a schema registry, but you’re still responsible for metadata beyond schemas
like tracking the semantics of Kafka topics and understanding the difference between
change data capture versus append-only streams, as well as creating and maintaining your
own data catalog. But they may not provide certain features that many organizations would
deem critical. For example, you still need the ability to configure, maintain, and support the
platform, to implement the integration between external systems, and to offer the flexibility
to create custom Flink jobs in the language of your choice. And you may also need to be
able to optimize processing jobs and scale resource availability up and down for your tasks
and workloads to manage operational costs more efficiently.
Figure 3. Architectural components of a managed provider approach to a real-time ETL and stream
processing platform.
PROS: CONS:
• Additional Services: Managed providers • Partial Solution: While they provide additional
offer services like job control, resource services, some critical aspects like data
management, and security features, catalog maintenance and fine-tuning may still
reducing the workload on your team. fall on your team.
• Streamlined Operations: They handle • Developer Restrictions: The managed
essential tasks, allowing your team to focus provider may only offer a limited set of
more on application development rather options for creating Flink jobs, as opposed to
than infrastructure management. being able to use your choice of SQL, Java, or
Python as appropriate for your environment.
8 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
OPTION #4
Fully-Managed Real-Time Data Platform
The most complete solutions for real-time data are provided by those who can offer
a fully-managed platform, one that integrates all the necessary components so that
your stream processing jobs just work. In addition to Flink as the base layer, they
provide the observability that production environments demand, the security features
that protect your data, and expose the controls you need to run your workloads
efficiently, without burdening you with internal configuration settings that need
endless fine-tuning to keep the platform running smoothly.
At this level, the focus is on implementing your business logic, whether that’s in SQL,
Java, or Python. No pre-existing Flink knowledge is assumed or required, although it
should be your choice—if you do have Flink JAR files you need to run, that needs to
be fully supported. If you’d prefer to use SQL, you don’t have to write any Java code at
all. The developer experience also includes a web UI, a robust CLI tool, a dbt adapter,
and a unified set of APIs for the entire platform so that it works seamlessly within your
established ecosystem.
One litmus test for determining how quickly you will be able to get value out of any data
movement and stream processing offering is to look at how much pre-existing Flink
knowledge you really need in order to use the product. From our perspective, if you
have Flink knowledge, then that’s great—it’s a bonus. You can take advantage of that
knowledge and experience. At the same time, if you do not have that knowledge, then
the first step of your customer journey should not be learning about partition topics and
managed state backends and all the internal details of using Flink effectively.
9 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Decodable Provides
the Solution
Decodable is a real-time ETL and stream processing No matter which camp you’re in, Decodable offers a
platform powered by Flink and Debezium. We’ve simplified, unified approach to real-time ETL with a
purposely built Decodable to meet the needs of two very fully-managed, serverless platform that eliminates the
different audiences, and our job is to be able to make complexity and overhead of managing infrastructure,
both sets of people equally productive without having to allowing data engineering teams to focus on building
get into the nuts and bolts of what it means to get the pipelines using SQL, Java, or Python. With a wide range
most out of Flink. The first audience includes “Flink Pros,” of connectors for seamless data movement between
or people who have been running and maintaining Flink systems and powerful stream processing capabilities,
clusters, building Flink jobs, and managing the entire Decodable enables the development of real-time
environment on their own. These are people who want pipelines with ease. The platform ensures operational
to have quite a bit of flexibility and control over how their reliability through built-in security, compliance, and a
jobs are implemented and executed. And then there dedicated support team that acts as the first line of
are people who honestly don’t care about the specific defense for your data pipelines.
underlying technologies of our platform. They know SQL,
and that’s where they want to focus.
10 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
each and every connector in our library. Notably, the focus is on depth rather
than breadth, because it’s really about supporting these external data systems as
thoroughly as possible. We strive to provide robust support for different systems,
including capabilities such as change data capture (CDC), without requiring that you
need to worry about the boundary between Flink and Debezium. Our connectors
seamlessly handle that level of detail for you.
11 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Making the Case
for Decodable
The use cases for data movement and real-time stream processing continue to
expand across every business sector. Here are some brief examples of how we’ve
been able to help our customers overcome the challenges they were facing and
achieve their goals.
Using our connectors for MySQL CDC and Snowflake Snowpipe Streaming, they were
able to transfer an initial snapshot followed by continuous real-time updates, reducing
latency from over two weeks down to seconds. With Decodable, they now have a
real-time data replication and stream processing solution that handles over 100,000
CDC streams while at the same time lowering their costs for both extracting data from
their source systems and loading into their destination systems.
MySQL Snowflake
Data replication from Used Decodable’s MySQL CDC Achieved real-time data
MySQL was failing, with and Snowflake Snowpipe replication capable of
processing latencies Streaming connectors to handling over 100,000
exceeding 16 days. implement continuous real-time CDC streams and
updates. reduced the total cost.
12 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Ticket
events
Live Application Clickhouse
Historical reporting
Ticket prioritization and routing,
real-time analytics
Recruitment Platform
A customer offering a recruitment platform was experiencing high latency in their
candidate searches, causing poor experiences for the recruiters working in their
platform. To increase customer satisfaction, they needed to provide much lower
search latencies and continuously process new data in real-time to improve the
quality of their search results.
Using our connectors for MySQL CDC and Elasticsearch, they were able to simplify
and optimize their data movement and real-time stream processing. This resulted in
reducing search times from tens of seconds to less than one second, which greatly
improved the customer experience and drove an increase in customer retention.
MySQL Elasticsearch
13 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Summary
When it comes to real-time stream processing solutions, there are several different
levels of service being offered. These run the gamut from:
• Flink-as-a-Service. While useful, this is not generally sufficient for the majority
of customers and leaves quite a number of challenges for teams to solve on their
own. You are being asked to handle the connections to external data systems, to
make sure you get the versions right while keeping up with security patches, and to
configure many of the details of Flink.
14 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
A Side-by-Side
Comparison
DIY on Cloud Flink-as- Managed Fully-Managed
Infrastructure a-service Provider Platform
Customer
Business SQL
Logic Specific
providers may
Java
limit language
support
Python
Provider
Offering Web UI
CLI
dbt Adapter
Unified API
Connector Library,
Partial
including CDC
Job Control
Resource
Management
Security
Engineering
State Management
Apache Flink
Cloud
Infrastructure
15 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Join our upcoming tech talk
16 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
About Decodable
At Decodable, our mission is to redefine real-time ETL, eliminating
the complexity that consumes data teams. We believe accessing
timely, high-quality data should be effortless, not a constant uphill
battle. That’s why we’ve built a powerful yet easy to use real-time
ETL platform that absorbs the challenges of building and maintaining
pipelines. With Decodable, data teams can easily connect data
sources, perform real-time transformations, and deliver reliable data to
any destination.
© 2024 Decodable. Apache®, Apache Flink, Flink®, Apache Kafka, Kafka®, Apache Spark, Spark®, and the Flink
squirrel logo are either registered trademarks or trademarks of The Apache Software Foundation.