0% found this document useful (0 votes)
16 views10 pages

20221028-EB-Kafka in The Cloud 10x

Uploaded by

bd445vpfgv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

20221028-EB-Kafka in The Cloud 10x

Uploaded by

bd445vpfgv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Kafka in the Cloud

Why It's 10x Better


With Confluent
3 INTRODUCTION

Better than Kafka

5 Elasticity
Section 1

Contents
7 Section 2

Storage

9 Resiliency
Section 3

A C C E S S

AUDIOBOOK

2
With cloud becoming the new normal for modern IT infrastructure, people often imagine that bringing open source
software like Apache Kafka® to the cloud is a simple matter of packaging up the software and putting it in
Kubernetes on some public cloud instances. In reality, it’s much harder than that.

Better than Kafka


Indeed, Kafka is built to be easy to get started with. A reasonably skilled team can easily manage one cluster well enough in a single environment in which
it might be installed and operated. However, doing so doesn’t unlock the true value of cloud and the reliability, scalability and elasticity that modern cloud
infrastructure can provide. Moreover, as your usage and use cases grow, the operational challenges of self-managing a complicated distributed system on
cloud will only grow exponentially. For example:

With 97% of organizations around


the world tapping into real-time data 1 Complex sizing, provisioning, and load balancing when
expanding and shrinking your infrastructure to meet
streams, data streaming has become fluctuating customer demand

imperative for businesses to thrive in 2 Carefully planning and provisioning storage limits,
1

3
today's dynamic, digital-first constantly throttling tenants, and expiring data for clusters LB
to ensure your retained data doesn’t exceed your broker disk
landscape. And Apache Kafka®, capacity

adopted by over 70% of the Fortune 3


2
Diverting engineering resources to design and configure your ...
500, has become the de facto Kafka resiliency policy, monitor and address unplanned
Broker 0 Broker 1 Broker 2
downtime or breaches, and conduct manual upgrades or
standard for data streaming.
patching to keep Kafka up and running

To truly realize the value of the cloud and focus your resources on business growth, In short, you need a cloud service for Kafka that takes
you need a fully managed cloud-native service that abstracts these operational limited data and infrastructure capabilities and transforms
complexities away for you.
them into highly available shared resources that teams can
use as much or as little as needed—whenever they want.

The ideal solution will


Operate seamlessly in a massive multi-tenant architecture
Run highly optimized on a limited, but heavily tuned, set of cloud environments
Scale resources elastically to meet fluctuating demand
Apply automatic self-protective limits on every operation
Adapt to highly automated, data-driven operations

3
Enter Confluent Cloud -

A truly cloud-native service that is 10x better Confluent Cloud is the only truly fully
managed, cloud-native service for
Confluent Cloud allows us to harness the full power of cloud and provide a Kafka service that is Apache Kafka.

substantially better than Kafka alone. In fact, across a number of performance metrics, Confluent
Over the last five years, we’ve poured more than 3 million
Cloud is now 10x better than self-managed open source Kafka or semi-managed services.
engineering hours into building a Kafka service that is:
Confluent Cloud offers Apache Kafka’s protocol and is 100% compatible with the open source
ecosystem. And since it’s purpose-built for the cloud, virtually every layer in the stack has been
transformed, from how data is routed over the network, how requests are processed, how data is Cloud Native
stored, where data is placed and when it is moved to how all of this is controlled and observed at scale. We’ve completely re-architected Kafka for the cloud to be elastically
scalable and globally available—providing a serverless, cost-effective,
and fully managed service ready to deploy, operate, and scale in a
matter of minutes.

Complete
Confluent completes Kafka with 120+ connectors, stream
In
In this
this ebook,
ebook, we’ll
we’ll explore
explore three
three specific
specific areas
areas where
where Confluent
Confluent processing, enterprise security and governance, global resilience, and
Cloud
Cloud has
has re-architected
re-architected Apache
Apache Kafka
Kafka to
to be
be 10x
10x better—
better— more—eliminating the burden and risk of building and maintaining
elasticity,
elasticity, storage,
storage, and
and resiliency.

resiliency.

these capabilities in-house.

Along
Along the
the way,
way, we’ll
we’ll discuss
discuss how
how we
we achieved
achieved these
these improvements
improvements
and Everywhere
and the
the benefits
benefits your
your teams
teams stand
stand to
to gain
gain from
from them.
them. Whether in the cloud, across multiple clouds, or on-prem, Confluent
has you covered—plus, you can seamlessly link it all together in real
time to create a consistent data layer across your business.

4
Elasticity Broker 2 Broker 2 Broker 2

Scale to handle GBps+ workloads


and peak customer demands 10x Hotset
faster and easier Local Storage Local Storage Local Storage

Elasticity is the cornerstone of cloud-native computing. It allows Long-term


businesses to scale quickly, add resiliency to a system, and make storage O b j e c t s t o r ag e
products more cost effective.

Carefully planning and provisioning storage limits, constantly


throttling tenants, and expiring data for clusters. They expect Confluent Intelligent Storage: Multi-layer storage workload heuristics to allow less data stored on brokers and auto-rebalancing.
their data streaming platform to scale up and down rapidly and
easily to meet customer demand. Take our customer Instacart,
for example. When Instacart became the essential way to get
groceries during the pandemic, the grocery delivery startup Let’s explore an example of an OSS Kafka cluster with three brokers [0, 1, 2]. The cluster retains messages for 30 days and
adopted Confluent to dramatically scale its data systems to receives 100 MBps on average from its producers. Given a replication factor of three, the cluster will retain up to 777.6 TB
quickly serve over 500,000 new customers.
of storage, or about 259.2 TB per broker with equal balance.
To meet the demands of these modern data pipelines, we took
Apache Kafka’s horizontal scalability to the next level and made Now when we scale the cluster by one broker, we need to bring the new the data that is on the physical brokers. The actual ratio of data in
Confluent Cloud 10x more elastic. To understand how, let’s first machine up to speed. Assuming perfect data balancing, this means the object storage vs. the brokers is dynamic, but in this scenario, let’s
dive into how scaling happens within Apache Kafka.
new broker will be storing 194.4 TB (¾ of 259.2 TB), which will need to assume that one day’s worth of data sits on the brokers and the rest
be read from the existing brokers. Assuming the full bandwidth was sits in object storage. That’s a 1:30 ratio.

available for replication, it could take 43 hours using open source Kafka.

In the example above, each broker would then keep 8.6 TB locally and
To solve the elasticity challenge, Confluent Cloud developed Intelligent 251.6 TB in object storage, and we only need to move 6.5 TB data in
Storage, which uses multiple layers of cloud storage and workload total (¾ of 8.6 TB).

heuristics to make data rebalancing and movement even faster. Brokers


In Confluent Cloud, the scaling operation is kicked off immediately
in Confluent Cloud keep most of their data in object storage and retain
after you prompt the system for more resources. The whole operation
only a small fraction of the data on local disks. Less data on the brokers
is online, which means no downtime, and with Confluent Cloud’s self-
means less data movement and, therefore, faster recovery when
balancing mechanism partitions, data starts moving to the new
failures happen.

machines right away. So when you scale the cluster on a 10 gigabit


So, when you scale up the cluster in Confluent Cloud by adding a broker, network, the complete scaling operation will only take 1.4 hours in our
you don’t have to move nearly as much data around—just example. Compared to the 43 hours with OSS Kafka, that’s up to
30x faster!

5
Scaling up and down instantly with Confluent Cloud
How 10x elasticity can reduce
Elastic scaling applies when sizing both up and down. O
​ nce the holiday rush is over, you don’t want that highly
provisioned cluster sticking around costing you money. However, teams working with Apache Kafka have historically
your total cost of ownership
had very limited, time-consuming options for how to do this.
Avoid over-provisioning: With Confluent Cloud, you
can scale your Kafka clusters right before the traffic
Capacity adjustments require a complex process of sizing and provisioning As validated in a recent study by Gigaom, Confluent Cloud completely
manages scaling operations and therefore requires zero time or hits. And with the ability to shrink fast, you avoid
of new clusters, networking setup, traffic balancing across new brokers
interaction to manage the size of the infrastructure. By contrast, overpaying for any excess capacity when traffic
and partitions, and much more. Too often, the manual effort isn’t worth
the savings of running a smaller cluster.
Apache Kafka requires considerable time and effort (47 story points) slows down
for networking, load balancing, etc.

With Confluent Cloud, you can shrink clusters just as fast as you can , v v :
Enable a faster less expensi e ser ice Because
expand them. Our Basic and Standard offerings allow instant auto-scaling To learn more about what efforts it would have taken you to resize a Confluent Cloud takes advantage of a blend of
and can scale down to zero with no pre-configuration. With our Dedicated cluster with self-managed Kafka, check out a live discussion and demo.
object storage and faster disks in an intelligent
clusters, you can scale down clusters by just moving the CKU slider.
manner, you get a lower latency service, at a
Read the blog post to learn more about how we made Confluent Cloud
The best of all of these is that the 10x faster scaling happens behind the 10x more elastic.

competitive price, that s billed only when you use it

curtain of our fully managed offering, making scaling 10x easier for you.
v :
Remo e cycles spent on infrastructure Since
scaling is 10x easier without operational burdens,
you can reallocate your engineering resources to
build something that differentiates your business.

“Elasticity is really important at Ampeers. As we


throughput

AK capacity planning

work with new customers, we don't know how


many data flows are in each second so we have
Confluent Cloud
through

CKU capacity to be ready for anything. Auto-scaling clusters


put

easily without having to push the buttons and

Actual throughput
make changes alleviates work on our end and
makes it easy. ”

Lucas Recknagel

Chief Technology Office r

Time A MPEERS Energy

With faster scaling up and down, businesses save on TCO by avoiding wasted capacity from over-provisioning.
6
Storage Powerful use cases are enabled when Kafka storage limit is lifted:

System of record: Kafka is often the first place data lands across systems.
To illustrate why the separation of storage and compute
matters to resource utilization, imagine a use case that
requires us to produce at a sustained rate of 150 MB/s to a
Never worry about Kafka retention With no retention limits, Infinite Storage helps establish Confluent Cloud as
a source of truth in an organization, providing a persistent and authoritative
topic with the default 7-day retention. This means that by
day seven, you would have about 272 TB of data. In the
limits again with Infinite Storage view of data that any user or system can immediately access.
image below, the area shaded in blue represents the
Real-time application and analysis with historical data: With Confluent amount of throughput actually required to satisfy these
Keeping real-time and historical data in Apache Kafka allows for Cloud as a system of record, businesses have access to real-time and requirements, while the area in red represents the extra
more advanced use cases, faster decision making, and better historical events from any time period. This powers all sorts of use cases— unused compute resources you’ll have to provision to
compliance with data retention requirements. However, ​there is a from data reprocessing with all historic data to rebuilding history from satisfy storage requirements when operating in open
practical limit to how much you can store on a single Apache various systems to machine learning (ML). Check out a live ML demo source Kafka, since you can only attach limited amount of
Kafka broker.
leveraging Infinite Storage.
storage to a single broker.

Because storage and compute are tied together, you have to Meeting data retention compliance requirements: For example, financial With Confluent Cloud, because storage and compute are
provision additional brokers and pay for more resources when you institutions are required to retain data for seven years. During audits, separated, storage can automatically scale as you need it
companies usually create a new application just to surface data from this without limits on retention time. It provides a proper cloud
hit that limit. This makes retaining the right amount of data in consumption model that allows users to store as much
Kafka operationally complex and expensive—with operators time period. It’s infinitely simpler to read this data from an existing Kafka
log than having to reload data from various data sources.
data as they want, for as long as they need, while only
having to constantly throttle tenants to monitor storage limits, paying for the storage used. As a result, you’ll be able to
expire data for clusters that reach capacity, and negotiate scale your storage infinitely better than Apache Kafka.
retention times with application teams to maintain cluster
uptime and cut costs.

Thankfully we’ve solved these challenges by building Infinite


Storage on Confluent Cloud, so our users never have to worry 1000
about Kafka storage limits again. Infinite Storage starts with the
separation of compute and storage resources. This allows us to

(MB/s)
Unused throughput

store a subset of “hot” data locally on the broker while offloading 750 required with Apache Kafka

the majority of data to a respective cloud provider’s object


storage tier.
Throughput Throughput required

with Confluent Cloud


500

Compute (Apache Kafka)

250
Compute (Confluent Cloud)

0
50,000 100,000 150,000 200,000 250,000 3 00,000

S torage (GB)

Growth of compute resources as storage scales (Apache Kafka vs. Confluent Cloud).
7
10x more performant storage with Confluent Cloud
How 10x storage can reduce your
Performance is another place where we made storage 10x better. In Apache Kafka, mixing consumers
that read both real-time (latest) and historical (earliest) data can cause a lot of strain on the I/O system, total cost of ownership
slowing down throughput and increasing latency.
Pay only for the retained data: Confluent Cloud is pay-as-you-go, so
One of the great performance wins with Infinite Storage is the resource isolation that happens between these reads. Since “historical” you are only billed for actual storage used, not for any pre-
consumers read from object storage, they rely on a network that consumes a separate resource pool than real-time consumers. With provisioned or unused capacity. Also, with compute and storage
this adjustment, the large batch reads that you typically see with historical workloads will not compete with the streaming nature of separated, you no longer have to waste unnecessary compute
real-time workloads, preventing latency spikes and improving throughput.
infrastructure for storage-bound use cases.

Read our blog post to learn more about how we built Kafka storage that’s 10x more scalable. Reduce operational burdens: Infinite storage by itself removes all the
operational complexities of dedicatedly planning and retaining the
right amount of data in Kafka. What’s more, Confluent Cloud can
auto-scale storage based on retention policies and traffic, further
relieving your operations resources to more value-adding activities

Real-Time Historical Avoid downtime, data loss or a possible breach in data retention
Consumers Consumers
compliance: With infinite storage, you never have to worry about
revenue loss or audit fines due to storage downtime or data loss
caused by capacity limit or data expiration.
Broker Broker Broker

Storage Thread Network Thread


The challenge with storage is the same that you face with all
infrastructure that you provide. The storage grows, the
Local Storage Local Storage Local Storage storage has to be managed, the storage can fail, or you need
to provision the storage, it's really difficult to understand the
classes of storage that you need to put beyond the scene. So
with Confluent services, we've been relieved from all that."
O b j e c t s t o r ag e

Olivier Richaud

Senior Director, Technical Platforms and


Foundations, Symphony

Resource isolation between real-time and historic data consumption leads to 10x storage performance.

8
Kafka is designed with high availability and durability through replications. However, this design is insufficient for a highly reliable data

Resiliency streaming service in the cloud—and doesn’t take away all the operational burdens and risks. This is especially true when complexities

multiply as Kafka spans across more use cases, apps, data systems, teams, and environments. Issues that can arise include:

Limited downtime protections for Kafka software failures, zone failures, or cloud provider outages
Leave Kafka reliability
Lack of durability monitoring to help detect, prevent, and mitigate data integrity issues, either in real-time or in batc

worries behind with 10x High operational complexities for availability configuration, resiliency policy design, disaster recovery and failover deployment, manual upgrades and patching, etc.

built-in availability and

durability
Confluent Cloud’s SLA covers not only infrastructure but also Apache
Client
x
Kafka performance, critical bug fi es, security updates, and more —
A cloud product is only as useful as it is resilient. As
something no other hosted Kafka service can claim.

infrastructure NETWORKING
businesses mature in their Apache Kafka adoption FAILURES K8S FAILURES

for mission-critical applications, the stakes become


How did we achieve this ? Here are a few areas to highlight
LB LB LB
even higher. Any unplanned downtime and LB LB LB Built-in multi-zone availability in the product: While Kafka service is
breaches can result in lost revenue, reputation
Broker 0 Broker 1 Broker 2 configured to have three replicas for every Kafka topic, Confluent

damage, fines or audits, reduced customer Cloud makes sure that these three copies are distributed across

satisfaction scores, or critical data loss. The result? three different availability zones. This ensures that two copies are

Valuable engineering resources have to be diverted EBS FAILURES x


available even if an availability zone e periences failures

to control the damage.

World-class cloud monitoring: We monitor each cluster with Health+


Storage to proactively identify and address production issues before they
Storage Layer (optional) downtime
For a cloud service, resiliency means two things: impact workload

availability—the platform is always accessible, and :


Automation to handle cloud failures With built-in self-healing

durability—the data stored within is protected


automation, Confluent Cloud auto-detects the unavailability of

Downtime or failures can happen in multiple areas in a typical Kafka deployment. cloud services (storage, compute, etc.) and mitigates its impact by
from any corruption or loss.
/
appropriately isolating the impacted node replica. t also auto- I
/
rebalances the cluster to add remove or uneven workload scenarios.

Let’s take an OSS Kafka deployment in the cloud (that follows all best

practices) with three brokers running in one availability zone with a

Kubernetes cluster, some servers, and load balancers to increase availability.


x
The smallest increase in SLA means e ponentially less downtime for your
While there is no SLA from OSS, we can estimate its de facto SLA based on
customers. Just two additional “.9s” in Confluent Cloud’s SLA translates to, at
the SLA from these underlying components Assuming three brokers on EC21,
most , 52.6 mins of downtime per year. Compared to our example OSS Kafka
three LBs, and three EBS volumes, we can estimate its de facto SLA to be:

deployment that’s 100x less downtime, and it’s why we say that Confluent

cloud offers beyond 10x better availability than Apache Kafka. And the best

8.45% ~
~
3
5% =
3 3
99. x 99.99% x 99.99% 9 99% part ? This is all built in to the product and doesn’t require additional FTEs to
manually maintain a strong SLA on your own.

We’ll be generous and round up to 99%. This translates to a potential 5,256


What’s more, to further minimize downtime and data loss in a regional outage

mins (or 3.65 days) of downtime per year.

for a public cloud provider, we introduced Cluster Linking and Schema Linking

to help organizations architect a multi-region disaster recovery plan for Kafka.


Confluent Cloud promises 10x higher availability with a built-in 99.99%

uptime availability SLA for our customers. This is one of the industry’s
1 SLA for AWS components ma: x 99.5% for individual EC2 instances,
highest and most comprehensive SLAs.
99.99% for EBS volumes, 99.99% for elastic load balancers

9
10x durability through automatic auditing services
How 10x resiliency can reduce your

Durability is the other side of resiliency, and is a measure of data integrity. Apache Kafka primarily guarantees total cost of ownership

high durability through redundancy. We’ve further built robust tooling and durability auditing services to

Avoid downtime, data loss and business risks: Unplanned


prevent, detect and mitigate data integrity issues, relieving operators from the burdens of monitoring data
downtime and breaches can lead to substantial hidden costs,
integrity themselves. At Confluent, we provide durability for an average of tens of trillions of messages per day.
including lost revenue, reputation damage, fines/audits, or

reduced customer satisfaction score (CSAT). With built-in

99.99% uptime SLA and robust durability auditing services, these

risks are significantly reduced with Confluent Cloud

Kafka Broker
Dashboards and Alerts
Offload Kafka maintenance: Building and maintaining a high

level of system resiliency into self-managed Kafka takes a lot of

effort and engineering time, and the resources required will

increase by 50% year-over-year as customer demand rises.


Durability Event Aggregation Durability Audit Service
Durability Database
Confluent Cloud offers a fully managed service that deploys,
Event Topic

optimizes, upgrades, patches, and recovers Kafka out of the box,

so you can divert your valuable engineering resources away from

keep-the-lights-on activities.
Broker-Local Storage

Object Storage

Read our blog post to learn more about how you can leave Kafka

reliability concerns behind with Confluent Cloud.


Confluent Cloud performs extensive durability auditing and monitoring with the added benefit of real-time detection and alerting.

With Kafka at its core, Confluent offers a truly cloud-native service "With Confluent Cloud, we no longer have to chase downtime
Take Confluent
that enables your business to set its data in motion while avoiding the brokers, spin them back up and see how to recover them. We

Cloud for a test headaches of low-level data infrastructure management. It’s the only see much less unbalanced partitions and don't have to think

cloud Kafka service with enterprise-grade features, security, and zero about all kinds of edge cases with the brokers that we used to

spin and never handle in the past. So I have to say that the resiliency of our
ops burden for all of your data streaming needs—available everywhere

infrastructure system has really improved with Confluent."


you need it.

look back.
An easier, safer, and more cost-effective way to stream real-time data
Natan Silnitsky

is waiting.

Senior Backend Engineer at Wix

GET STARTED
Get started with Confluent Cloud for free.

10

You might also like