0% found this document useful (0 votes)
50 views17 pages

Buyers Guide - Decoding The Top 4 Real-Time Data Platforms Powered by Apache Flink

Uploaded by

avilanchee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views17 pages

Buyers Guide - Decoding The Top 4 Real-Time Data Platforms Powered by Apache Flink

Uploaded by

avilanchee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

A BUYER’S GUIDE

Decoding the Top 4


Real-Time Data Platforms
Powered by Apache Flink ®

Considerations for selecting a platform for real-time ETL


and stream processing at scale
Introduction
In recent years, the landscape of data infrastructure technology has
undergone a transformative shift as companies deploy new workloads
powered by real-time applications and services. This evolution has
resulted in a demand for instantaneous data processing and analytics
at scale across various business sectors. From improving process
automation, to personalizing customer experiences, to increasing the
efficiency of critical business applications, organizations are increasingly
harnessing the power of data to drive efficiency, innovation, and
competitive advantage.

Within this rapidly evolving ecosystem, navigating the myriad of service


providers and solution offerings can be daunting. To aid in this decision-
making process, we present this buyer’s guide, created to provide
you with the information you need to select the right solution for your
specific requirements. Whether you’re a seasoned professional seeking
to optimize and enhance the functionality of your existing infrastructure
or a newcomer exploring the realm of real-time data processing, this
guide will serve as an indispensable resource on the journey towards
selecting the perfect solution for your needs.

2 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
The Rise of Apache Flink
Apache Spark and Apache Flink stand out as robust open-source frameworks for
processing streaming data, but they do so in fundamentally different ways due
in large part to their origins and primary intended use cases. Spark was originally
developed as a parallel data processing engine for at-rest data, while Flink was
developed as a streaming-first parallel data processing engine. While Spark has
developed streaming capabilities and Flink has developed batch capabilities, the
depth of these capabilities still differ significantly.

Some important points of distinction between Spark and Flink:

Processing Model. Spark is first and foremost a batch processing


engine, but has added Structured Streaming to support the processing Stateful stream processing
of small batches of data at a fixed interval. These microbatches are is the ability for a stream
often lower latency than true batch processing, but are still much higher processing engine to
latency than true stream processing systems. Structured Streaming has “remember” data across
recently released a “continuous” mode of processing, but it is still marked events. State must be kept
as experimental and is only available on the enterprise platform. Flink is a consistent with input data
streaming-first processing engine that natively processes each record as for correct, reproducible
it arrives. This results in latencies often as low as milliseconds, making it processing.
suitable for online systems such as threat detection and response, fraud
detection, location tracking, inventory management, and many others.

Language and API Support. Both Flink and Spark support SQL, Java,
Scala, and Python. Both also provide a layered API, offering different levels
of flexibility/convenience tradeoffs to support stream processing use
cases. However, the Spark API only supports a basic set of operations,
whereas Flink provides a much richer set of lower level processing
primitives that gives users access to window punctuation, timers, and
event state for handling late arriving data.

Change Data Capture (CDC) Support. One of the many features


that Spark Structured Streaming currently does not have is a built-in Change Data Capture
CDC connector framework. To use CDC, Spark must consume change (CDC) is a commonly used
data events captured via external tools (such as Debezium) that publish technique for real-time data
to Kafka, which Spark then processes. This method requires setting up integration across different
an additional component to manage the CDC logic before the data gets data systems.
ingested by Spark. Flink offers a robust set of native CDC connectors
built on top of Debezium. Flink’s CDC connectors can directly connect to
databases and capture changes without the need for additional tools such
as Kafka or Kafka Connect.

3 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Consequently, Apache Flink has emerged as the de facto standard for stream
processing, gaining widespread adoption among some of the industry’s most prominent
innovators, including Alibaba, Uber, and Netflix. Continuously growing in adoption due
to its performance, fault tolerance, and scalability, Flink has become the backbone of
many real-time data processing pipelines.

The open-source nature of Flink offers unparalleled flexibility and innovation potential,
but it also presents a significant challenge for organizations: the need for a highly
skilled and specialized workforce to effectively use its capabilities. Building and
maintaining the infrastructure required to support real-time applications powered
by Flink demands a team of expert engineers proficient in areas such as distributed
systems, data engineering, and stream processing. It is also necessary to have
specialists to integrate other projects and their capabilities, such as Debezium to
support change data capture (CDC).

This poses a dilemma for many organizations, as the reality is that only a select few
possess the resources and expertise necessary to tackle the complexities of Flink
at scale. For the vast majority of companies, this represents a formidable barrier to
entry for supporting real-time data processing. In response to this challenge, many
companies turn to vendors and service providers who offer tailored solutions to simplify
the adoption and management of Flink-based applications. By using the expertise and
resources of these vendors, organizations can overcome the barriers posed by Flink’s
complexity and focus on using its capabilities to drive innovation and growth.

Selecting the Right


Service Provider for
Your Needs
In order to run a successful data infrastructure technology stack at scale in
production, there are many different component areas which are required beyond the
fundamental core open source software provided by Flink and Debezium. Different
service providers have created offerings that target a wide range of these needs,
from the lowest infrastructure layer which can be used to support a DIY approach, all
the way up to a fully-managed platform offering, with multiple levels of cloud-hosted
options in the middle.

Let’s explore these options along with what is offered at the different levels, what
remains for you and your team to create and manage after adopting each one, and
the pros and cons of each.

4 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
A Comprehensive Checklist
Successfully running a data infrastructure platform at scale in production requires the combined
capabilities of many different layers of functionality that build on each other:

Cloud Infrastructure
Fully hosted or “bring your own cloud”

Available in multiple regions

Open Source Software


Apache Flink

Debezium

Platform Management and Control


Resource management

Job control

Security engineering and access control

State management

Job optimization

Platform and job observability

Connector library, including CDC support

Multi-stream connections

Data catalog

Developer Experience
Unified API

Command line interface

Web UI

dbt adapter support

CI/CD workflow capable

Language Support for Flink Jobs


Create processing jobs using standard SQL

Support for custom jobs written in Java

PyFlink support for jobs written in Python

5 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
OPTION #1
Do It Yourself on Cloud Infrastructure
For companies who choose to pursue the DIY approach and build and run their own
data platform from scratch, this most commonly starts with a cloud service provider
to provision the infrastructure layer (IaaS).

At this level, you’re getting the raw compute, storage, and networking services, as
well as provisioning capabilities such as Kubernetes. The rest of the architecture,
including deploying Flink as well as building and managing all the other aspects of
your data platform, is entirely up to you.

To support your platform for running business-critical workloads in production, you’ll


need to have the right staffing in place for site reliability engineering (SRE) and
platform engineering. In addition to keeping the infrastructure running and optimized,
you’ll need data platform specialists with expertise in Flink and Debezium to deploy,
manage, and optimize these powerful but complex components.

Figure 1. Architectural components of a DIY approach to a real-time ETL and stream processing platform.

PROS: CONS:

• Flexibility: You have full control • Complexity: Building and maintaining a real-time data
over your data platform and can platform requires expertise in areas like SRE and data
customize it according to your platform engineering.
specific needs. • Cost Risk: Unforeseen issues can eclipse the initial cost
savings achieved in the early stages of deployment.
• Cost Savings: You can
potentially save costs by • Responsibility: You’re responsible for ensuring the reliability,
managing your own platform security, and optimization of your platform.
and underlying infrastructure • Resource Intensive: It requires dedicated staff and
efficiently. resources to manage the platform effectively.
• Support: Beyond the raw infrastructure, it’s up to you to
support Flink, Debezium, and any other open-source tools
that you deploy.

6 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
OPTION #2
Flink-as-a-Service
Some cloud service providers offer a basic hosted Flink service. While this is
incredibly useful, it is not generally sufficient for the majority of customers and leaves
quite a number of challenges for teams to solve on their own. You are being asked to
handle the connectors, to make sure you get the versions right while keeping up with
security patches, and to configure many of the details of Flink. State management,
optimization, observability, security, and compliance—these are all tasks that you’ll
need to provision and resource a data platform engineering team to handle. In
addition, there is no tooling or developer experience support provided, which will
need to be created internally as needed.

Figure 2. Architectural components of a Flink-as-a-Service approach to a real-time ETL and stream


processing platform.

PROS: CONS:

• Convenience: Hosting the Flink runtime • Limited Support: Higher-level layers


reduces the burden of managing like connectors, security features, and
infrastructure, allowing you to give more optimization are left in your hands.
focus to application development and other
• Flink Expertise Required: You still need
supporting components which are necessary
specialists to handle aspects like versioning,
for a production-ready platform.
configuration, debugging production issues,
• Access to Flink: You have access to the and optimization.
Flink engine and its native APIs, without the
overhead of running it on your own.

7 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
OPTION #3
Managed Provider
Managed Flink providers go even higher up the data stack. In addition to Flink, they may
offer Debezium and other connectors, and they can provide services such as job control,
resource management, and security features.

Again, while these are good offerings, they still do not provide the complete picture that
customers want and need in order to be successful. For instance, a managed provider
might give you a schema registry, but you’re still responsible for metadata beyond schemas
like tracking the semantics of Kafka topics and understanding the difference between
change data capture versus append-only streams, as well as creating and maintaining your
own data catalog. But they may not provide certain features that many organizations would
deem critical. For example, you still need the ability to configure, maintain, and support the
platform, to implement the integration between external systems, and to offer the flexibility
to create custom Flink jobs in the language of your choice. And you may also need to be
able to optimize processing jobs and scale resource availability up and down for your tasks
and workloads to manage operational costs more efficiently.

Partial Solution Provided

Figure 3. Architectural components of a managed provider approach to a real-time ETL and stream
processing platform.

PROS: CONS:

• Additional Services: Managed providers • Partial Solution: While they provide additional
offer services like job control, resource services, some critical aspects like data
management, and security features, catalog maintenance and fine-tuning may still
reducing the workload on your team. fall on your team.
• Streamlined Operations: They handle • Developer Restrictions: The managed
essential tasks, allowing your team to focus provider may only offer a limited set of
more on application development rather options for creating Flink jobs, as opposed to
than infrastructure management. being able to use your choice of SQL, Java, or
Python as appropriate for your environment.

8 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
OPTION #4
Fully-Managed Real-Time Data Platform
The most complete solutions for real-time data are provided by those who can offer
a fully-managed platform, one that integrates all the necessary components so that
your stream processing jobs just work. In addition to Flink as the base layer, they
provide the observability that production environments demand, the security features
that protect your data, and expose the controls you need to run your workloads
efficiently, without burdening you with internal configuration settings that need
endless fine-tuning to keep the platform running smoothly.

At this level, the focus is on implementing your business logic, whether that’s in SQL,
Java, or Python. No pre-existing Flink knowledge is assumed or required, although it
should be your choice—if you do have Flink JAR files you need to run, that needs to
be fully supported. If you’d prefer to use SQL, you don’t have to write any Java code at
all. The developer experience also includes a web UI, a robust CLI tool, a dbt adapter,
and a unified set of APIs for the entire platform so that it works seamlessly within your
established ecosystem.

One litmus test for determining how quickly you will be able to get value out of any data
movement and stream processing offering is to look at how much pre-existing Flink
knowledge you really need in order to use the product. From our perspective, if you
have Flink knowledge, then that’s great—it’s a bonus. You can take advantage of that
knowledge and experience. At the same time, if you do not have that knowledge, then
the first step of your customer journey should not be learning about partition topics and
managed state backends and all the internal details of using Flink effectively.

Figure 4. Architectural components of a fully-managed approach to a real-time ETL and stream


processing platform.

9 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Decodable Provides
the Solution

Decodable is a real-time ETL and stream processing No matter which camp you’re in, Decodable offers a
platform powered by Flink and Debezium. We’ve simplified, unified approach to real-time ETL with a
purposely built Decodable to meet the needs of two very fully-managed, serverless platform that eliminates the
different audiences, and our job is to be able to make complexity and overhead of managing infrastructure,
both sets of people equally productive without having to allowing data engineering teams to focus on building
get into the nuts and bolts of what it means to get the pipelines using SQL, Java, or Python. With a wide range
most out of Flink. The first audience includes “Flink Pros,” of connectors for seamless data movement between
or people who have been running and maintaining Flink systems and powerful stream processing capabilities,
clusters, building Flink jobs, and managing the entire Decodable enables the development of real-time
environment on their own. These are people who want pipelines with ease. The platform ensures operational
to have quite a bit of flexibility and control over how their reliability through built-in security, compliance, and a
jobs are implemented and executed. And then there dedicated support team that acts as the first line of
are people who honestly don’t care about the specific defense for your data pipelines.
underlying technologies of our platform. They know SQL,
and that’s where they want to focus.

Efficient, Secure, and Proven


Powered by Flink and Debezium, Decodable is able to deliver our services in an
efficient and secure way, with trusted open-source projects that are well known to
work in production and to support mission-critical applications. Our experience with
these projects enables us to configure our platform to take full advantage of the
underlying hardware, such as the AWS Graviton architecture.

Robust Connector Library


Like many other tech companies in the data space, we have a connector library.
The primary difference with Decodable is the depth to which we go in building

10 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
each and every connector in our library. Notably, the focus is on depth rather
than breadth, because it’s really about supporting these external data systems as
thoroughly as possible. We strive to provide robust support for different systems,
including capabilities such as change data capture (CDC), without requiring that you
need to worry about the boundary between Flink and Debezium. Our connectors
seamlessly handle that level of detail for you.

Automatic Configuration, Tuning, and Scaling


The Decodable platform also provides a rich set of automatic configuration, tuning, and
scaling capabilities. And while some of that is provided by the underlying technologies,
a great deal of it comes from a team of seasoned experts who have past experience
with being woken up in the middle of the night to address issues with problematic
settings in Flink or misconfigured jobs. For teams that have been tasked with running it
themselves, there is a whole class of Flink challenges that are less obvious, where you
need people who deeply understand how it works and how to debug it.

Our Cloud or Yours


There are two options for customers taking advantage of the Decodable platform.
The simplest way to get up and running quickly is our fully-managed offering, where
everything runs inside of Decodable’s cloud environment. For more complex or
compliance-constrained environments, we also support what we call “bring your own
cloud.” In this case, the data plane of the platform—the part that actually runs the
connections, processes the event streams, handles encryption keys, and isolates all of
the data—runs inside of your own VPCs.

Designed for All Developers


Working with the Decodable platform is very similar to working with AWS, Azure, or
other cloud services, with several options for interaction. There is a web-based UI
that you can use for data exploration and building processing jobs, first-class APIs
that expose all of the platform’s features and capabilities, a command line interface for
working directly in a terminal, support for declarative YAML configuration files for use
in CI/CD and version control workflows, and even a dbt adapter to integrate with that
ecosystem. We let you focus on implementing your business logic, whether that’s in
SQL, Java, or Python.

The Support You Need


Decodable offers multiple tiers of customer support, up to and including dedicated
technical support for enterprise customers. For every customer, our team of experts
are your first line of defense to ensure the platform is operating reliably and optimally.
If there are node failures, or checkpoints are not keeping up with your jobs, we can
take mitigating steps without getting you involved, assuming the necessary changes do
not change the cost profile. We can also proactively reach out when we see something
is wrong and suggest possible causes that you can either look into on your own, or we
can collaborate with you to resolve.

11 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Making the Case
for Decodable
The use cases for data movement and real-time stream processing continue to
expand across every business sector. Here are some brief examples of how we’ve
been able to help our customers overcome the challenges they were facing and
achieve their goals.

Compliance Automation Provider


A customer offering compliance automation had an existing ETL solution that
was failing and unable to handle their data volume and processing requirements.
They needed to replicate transactional data from over 4,000 MySQL databases to
Snowflake in real-time. The latency of data replication for their previous approach was
exceeding 16 days.

Using our connectors for MySQL CDC and Snowflake Snowpipe Streaming, they were
able to transfer an initial snapshot followed by continuous real-time updates, reducing
latency from over two weeks down to seconds. With Decodable, they now have a
real-time data replication and stream processing solution that handles over 100,000
CDC streams while at the same time lowering their costs for both extracting data from
their source systems and loading into their destination systems.

MySQL Snowflake

4.000 databases Real-time processing + replication Real-time analytics


(800.000 tables)

Challenge Solution Result

Data replication from Used Decodable’s MySQL CDC Achieved real-time data
MySQL was failing, with and Snowflake Snowpipe replication capable of
processing latencies Streaming connectors to handling over 100,000
exceeding 16 days. implement continuous real-time CDC streams and
updates. reduced the total cost.

Customer Experience Platform


Another customer wanted to provide a better in-application experience for their
customers. The current experience was suffering due to a lack of fresh data and timely
updates, causing an increase in customer support calls and low customer satisfaction.

Decodable provided a fully-managed Flink service to process customer requests,


route and prioritize tickets, provide real-time analytics, and enable historical reporting
data in Clickhouse. This resulted in increased NPS and customer retention as well as a
higher win rate against competitors due to category-leading features.

12 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Ticket
events
Live Application Clickhouse

Historical reporting
Ticket prioritization and routing,
real-time analytics

Challenge Solution Result

In-application user Decodable provided a fully- Significantly improved


experience was suffering managed platform to process end-user experience
due to a lack of fresh data customer requests, route and causing higher win rate
and timely updates. prioritize tickets, and provide and increased customer
real-time analytics. retention.

Recruitment Platform
A customer offering a recruitment platform was experiencing high latency in their
candidate searches, causing poor experiences for the recruiters working in their
platform. To increase customer satisfaction, they needed to provide much lower
search latencies and continuously process new data in real-time to improve the
quality of their search results.

Using our connectors for MySQL CDC and Elasticsearch, they were able to simplify
and optimize their data movement and real-time stream processing. This resulted in
reducing search times from tens of seconds to less than one second, which greatly
improved the customer experience and drove an increase in customer retention.

MySQL Elasticsearch

Continuously updated Real-time replication Real-time search

Challenge Solution Result

Customer needed real-time Decodable provided optimized Reduced search time


data movement to drive MySQL CDC and Elasticsearch from tens of seconds
lower latency candidate connectors to enable per request to less than
search functionality. continuous real-time updates. one second, dramatically
improving customer
experience.

13 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Summary
When it comes to real-time stream processing solutions, there are several different
levels of service being offered. These run the gamut from:

• Do It Yourself on Cloud Infrastructure. At this level, you’re getting the raw


compute, storage, and networking services, as well as provisioning capabilities such
as Kubernetes. The rest of the architecture, including deploying Flink as well as
building and managing all the other aspects of your data platform, is entirely up to you.

• Flink-as-a-Service. While useful, this is not generally sufficient for the majority
of customers and leaves quite a number of challenges for teams to solve on their
own. You are being asked to handle the connections to external data systems, to
make sure you get the versions right while keeping up with security patches, and to
configure many of the details of Flink.

• Managed Provider. In addition to Flink, managed providers may offer Debezium


and other connectors, and they can provide services such as job control, resource
management, and security features. Again, while these are good offerings, they still
do not provide the complete picture that customers want and need in order to be
successful.

• Fully-Managed Real-Time Data Platform. The most complete solutions for


real-time data are provided by those who can offer a fully-managed platform, one
that integrates all the necessary components so that your stream processing jobs
just work. In addition to Flink as the base layer, they provide the observability that
production environments demand, the security features that protect your data, and
expose the controls you need to run your workloads efficiently, allowing you to focus
on implementing your business logic.

Decodable is a fully-managed real-time data platform powered by Flink and Debezium.


We’ve built an enterprise-grade, production-ready offering that just works for running
your business critical ETL and stream processing jobs. We provide everything you need
to be successful:

• A platform built on efficient, secure, and proven software


• A robust connector library, including CDC
• Automatic configuration, tuning, and scaling
• The option to run in your cloud or ours
• An experience designed for all developers
• The expert support you need

14 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
A Side-by-Side
Comparison
DIY on Cloud Flink-as- Managed Fully-Managed
Infrastructure a-service Provider Platform

Customer
Business SQL
Logic Specific
providers may
Java
limit language
support
Python

Provider
Offering Web UI

CLI

dbt Adapter

Unified API

Job Tuning and


Optimization

Data Catalog Partial

Platform and Job


Partial
Observability

Connector Library,
Partial
including CDC

Job Control

Resource
Management

Security
Engineering

State Management

Apache Flink

Cloud
Infrastructure

15 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
Join our upcoming tech talk

12PM ET, June 18, 2024 Virtual

16 | A Buyer’s Guide > Decoding the Top 4 Real-Time Data Platforms Powered by Apache Flink
About Decodable
At Decodable, our mission is to redefine real-time ETL, eliminating
the complexity that consumes data teams. We believe accessing
timely, high-quality data should be effortless, not a constant uphill
battle. That’s why we’ve built a powerful yet easy to use real-time
ETL platform that absorbs the challenges of building and maintaining
pipelines. With Decodable, data teams can easily connect data
sources, perform real-time transformations, and deliver reliable data to
any destination.

Our developer-friendly platform eliminates infrastructure overhead


while keeping pipelines running smoothly, ensuring data quality
and compliance, so you can focus on driving toward your business
goals. Whether modernizing your data stack or building cutting-
edge applications, Decodable empowers you with a world-class data
engineering experience.

© 2024 Decodable. Apache®, Apache Flink, Flink®, Apache Kafka, Kafka®, Apache Spark, Spark®, and the Flink
squirrel logo are either registered trademarks or trademarks of The Apache Software Foundation.

You might also like