0% found this document useful (0 votes)

33 views23 pages

Common Flink Mistakes

Uploaded by

Sergio Bruno

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views23 pages

Common Flink Mistakes

Uploaded by

Sergio Bruno

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

The Top 5 Mistakes Deploying Robert Metzger

Decodable

Apache Flink
Eric Sammer
Decodable
Webinar

[email protected] @rmetzger_
[email protected] @esammer
Today’s Webinar
The Top 5 Mistakes Deploying Apache Flink
Common Stream Processing Patterns using SQL
Q&A
Common Flink
Mistakes
Robert Metzger
Staff Engineer @ decodable, Committer and PMC Chair @ Flink
#1 Mistake: Serialization is expensive

- Mistake: People use Java Maps, Sets etc. to store state or do network
transfers
- Serialization happens when
- transferring data over the network (between TaskManagers or from/to
Sources/Sinks)
- accessing state in RocksDB (even in-memory)
- Sending data between non-chained tasks locally
- Serialization costs a lot of CPU cycles
#1 Mistake: Serialization is expensive

Example:
package co.decodable.talks.flink.performance;
start lon:11 lat:22
private static class Location { end lon:88 lat:99
int lon;
int lat;
}

DataStream<HashMap< String, Location >> s1 = ...

2 start co.decodable.talks.flink.performance.Location 11 22 end co.decodable.talks.flink.performance.Location 88 99

Map size 1st entry value type

1st entry
2nd entry 2n entry value type 2nd entry ~120 bytes
4 bytes 46 bytes key fields 46 bytes value
value 3 bytes
1st entry fields fields
key 8 bytes 8 bytes
5 bytes
#1 Mistake: Serialization is expensive

Example:
public record OptimizedLocation (int startLon, int startLat, int endLon, int endLat)
{}

DataStream< OptimizedLocation > s2 = ...

Further reading:
11 22 88 99 16 bytes “Flink SerializationTuning
Vol. 1: Choosing your
→ 7.5x reduction in data Serializer — if you can”
Fewer object allocations = less CPU cycles https://flink.apache.org/news/2020/04/15/flink-ser
ialization-tuning-vol-1.html

Disclaimer: The actual binary representation used by Kryo might differ, this is for demonstration purposes only
#2 Mistake: Flink doesn’t always need to be
distributed
- Flink’s MiniCluster allows you to spin up a full-fledged Flink cluster
with everything known from distributed clusters (Rocksdb,
checkpointing, the web UI, SQL, …)

var clusterConfig = new MiniClusterConfiguration.Builder()

.setNumTaskManagers( 1)
.setNumSlotsPerTaskManager( 1)
.build();
var cluster = new MiniCluster(clusterConfig);
cluster.start();
var clusterAddress = cluster.getRestAddress().get();

var env = new RemoteStreamEnvironment(clusterAddress.getHost(),

clusterAddress.getPort());
#2 Mistake: Flink doesn’t always need to be
distributed
- Use-cases
- Local debugging and performance profiling: Step through the code as it
executes, sample most frequently used code paths
- Testing: make sure your Flink jobs work in end to end tests (together with
Kafka’s MiniCluster, minio as an S3 replacement). Check out
https://www.testcontainers.org/
- Processing small streams efficiently
#3 Advice: Deploy one job per cluster, use
standalone mode
… unless you have a good reason to do something else.
- Flink’s deployment options might seem confusing. Here’s a simple framework to think about it:
- Flink has 3 execution modes
- Session mode
- Per-job mode
- Application Mode (preferred)
- Flink has 2 deployment models
- Integrated (active): Native K8s, YARN, (Mesos)
- Flink requests resources from the resource manager as needed
- Standalone (passive): well suited for K8s, bare metal, local deployment, DIY
- Resources are provided to Flink from the outside world
#3 Execution Modes

Session Mode Application Mode Per-Job Mode

Multiple Jobs share a One Job per JobManager, One Job per JobManager,
JobManager planned on the JobManager planned outside the JobManager

JobManager JobManager JobManager

Job1

Job1 Job2 Job3 Recommended Job1

as default
#3 Deployment Options

Passive Deployment Active Deployment

Flink resources managed externally Flink actively manages resources
(“Standalone mode”)

→ Flink talks to a resource manager

→ “a bunch of JVMs” Implementations: Native Kubernetes,
Deployed on bare metal, Docker, Kubernetes YARN

Pros / Cons: Pros / cons:

+ Reactive Mode (“autoscaling”) + Automatically restarts failed resources
+ DIY scenarios + Allocates only required resources
+ Fast deployments - Requires a lot of K8s permissions
- Restart
#4 Mistake: Inappropriate Cluster sizing

- Mistake: Under or over-provisioning of clusters for a given workload

- Understand the amount of data you have incoming and outgoing
- How much network bandwidth do you have? How much throughput
does your Kafka have?
- Understand the amount of state you’ll need in Flink
- Which state backend do you use?
- How much memory / disk space do you have (per instance, in your
cluster) available?
- How fast is your connection to your state backup (e.g. S3)? This will
give you a baseline for the checkpointing times
Solution: Proper cluster sizing

- Do a back of the napkin calculation of your use-case in your environment

- … assuming normal operation (“baseline”). Include a buffer for spiky loads
(failure recovery, …)
Example: Proper cluster sizing

● Data: ● Hardware:
○ Message size: 2 KB ○ 5 machines, each running a TaskManager
○ Throughput: 1,000,000 msg/sec
○ Distinct keys: 500,000,000
(aggregation in window: 4 longs per key)
○ Checkpoint every minute

Sliding
Kafka keyBy Window Kafka
Source userId 5m size Sink
1m slide

RocksDB
Example: A machine’s perspective

TaskManager n
Kafka Source 400MB/s / 5 receivers =
Kafka: 400 MB/s
80MB/s
2 KB * 1,000,000 = 2GB/s 1 receiver is local, 4 remote:
2GB/s / 5 machines = 400 MB/s keyBy 4 * 80 = 320 MB/s out

80 MB/s
Shuffle: 320 MB/s
window
Shuffle: 320 MB/s

Kafka Sink
Kafka: 67 MB/s
Excursion: State & Checkpointing

How much state are we checkpointing?

per machine: 40 bytes * 5 windows * 100,000,000 keys = 20 GB

We checkpoint every minute, so: 20 GB / 60 seconds = 333 MB/s

How is the Window operator accessing state on disk?

For each key-value access, we need to retrieve 40 bytes from disk, update the
aggregates and put 40 bytes back

per machine: 40 bytes * 5 windows * 200,000 msg/sec = 40 MB/s

Example: A machine’s perspective

TaskManager n
Kafka Source
Kafka: 400 MB/s
Shuffle: 320 MB/s
keyBy
Shuffle: 320 MB/s Kafka: 67 MB/s

80 MB/s
window
Checkpoints: 333 MB/s
Kafka Sink
Total In: 720 MB/s Total Out: 720 MB/s
Cluster sizing: Conclusion

- This was just a “back of the napkin” approximation! Real world results will
differ!
- Ignored network factors
- Protocol overheads (Ethernet, IP, TCP, …)
- RPC (Flink‘s own RPC, Kafka, checkpoint store)
- Checkpointing causes network bursts
- A window emission causes bursts
- Other systems using the network
- CPU, memory, disk access speed have not been considered
#5 Advice: Ask for Help!

- Most problems have been solved already online

- Official, old-school way: [email protected] mailing list
- Indexed by Google, searchable through https://lists.apache.org/
- Stack Overflow: the apache-flink tag has 6300 questions!
- Apache Flink Slack instance
- Global meetup communities, Flink Forward (w/ training)
Any Flink deployment
& ops related
questions?
Get Started with Decodable
● Visit http://decodable.co
● Start Free http://app.decodable.co
● Read the docs http://docs.decodable.co
● Watch demos on our YouTube Channel
● Join our community Slack channel
● Join us for future Demo Days and
Webinars!
Thank you.
Build real-time data apps &
services. Fast.

decodable.co 2022

CommVault Questions
100% (4)
CommVault Questions
13 pages
Apache Flink Introduction - Big Data Landscape
No ratings yet
Apache Flink Introduction - Big Data Landscape
26 pages
Apache Flink Tutorial
100% (1)
Apache Flink Tutorial
44 pages
External Refurbishment Process For Repairable Spares With Serial Number Integration - SAP Blogs
100% (3)
External Refurbishment Process For Repairable Spares With Serial Number Integration - SAP Blogs
35 pages
Top 10 Kafka Problems
No ratings yet
Top 10 Kafka Problems
3 pages
Stream Processing - Hands-On With Apache Flink (Giannis Polyzos) (Z-Library)
No ratings yet
Stream Processing - Hands-On With Apache Flink (Giannis Polyzos) (Z-Library)
234 pages
HTB Academy Report Template
No ratings yet
HTB Academy Report Template
24 pages
(Power BI Data Transformation) (Cheatsheet) - 2
No ratings yet
(Power BI Data Transformation) (Cheatsheet) - 2
7 pages
Motherboard Chipset: Definition - What Does Mean?
No ratings yet
Motherboard Chipset: Definition - What Does Mean?
3 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Chapter 7 Flink Stream and Batch Processing in A Single Engine
No ratings yet
Chapter 7 Flink Stream and Batch Processing in A Single Engine
45 pages
Mean Stack Technologies Lab Record
No ratings yet
Mean Stack Technologies Lab Record
49 pages
SAP FICO - F.05 - Foreign Currency Valuation
100% (4)
SAP FICO - F.05 - Foreign Currency Valuation
12 pages
Customer Spent Analysis Using K-Means Clustering
No ratings yet
Customer Spent Analysis Using K-Means Clustering
1 page
TLM 2.0 White Paper
No ratings yet
TLM 2.0 White Paper
5 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
AWS Interview
No ratings yet
AWS Interview
31 pages
Apache Flink
No ratings yet
Apache Flink
116 pages
Apache Flink.9443699.Powerpoint
No ratings yet
Apache Flink.9443699.Powerpoint
6 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
21 pages
02data Stream Processing With Apache Flink
No ratings yet
02data Stream Processing With Apache Flink
61 pages
Stream Processing Hands On With Apache Flink Free Lms Version
No ratings yet
Stream Processing Hands On With Apache Flink Free Lms Version
232 pages
Brkewn 3011 PDF
No ratings yet
Brkewn 3011 PDF
126 pages
Lecture Notes
No ratings yet
Lecture Notes
87 pages
Experiences Running Apache Flink at Very Large Scale: @stephanewen Berlin Buzzwords, 2017
No ratings yet
Experiences Running Apache Flink at Very Large Scale: @stephanewen Berlin Buzzwords, 2017
76 pages
Continuous Processing With Apache Flink: Stephan Ewen @stephanewen
No ratings yet
Continuous Processing With Apache Flink: Stephan Ewen @stephanewen
41 pages
Dimensionnement Spark - Les 5 Erreurs À Éviter
No ratings yet
Dimensionnement Spark - Les 5 Erreurs À Éviter
75 pages
P740en PRC11 PDF
No ratings yet
P740en PRC11 PDF
14 pages
Enensys Asiipguard Datasheet A
No ratings yet
Enensys Asiipguard Datasheet A
4 pages
Enterprise Payments
No ratings yet
Enterprise Payments
40 pages
Apache Flink® Training: Intro
No ratings yet
Apache Flink® Training: Intro
37 pages
Pseudocode Examples From Dave Mulkey
No ratings yet
Pseudocode Examples From Dave Mulkey
17 pages
Unified Batch and Real Time Stream Processing
No ratings yet
Unified Batch and Real Time Stream Processing
68 pages
Flink: Big Data Huawei Course
No ratings yet
Flink: Big Data Huawei Course
22 pages
ITHome - Deep Dive Into Apache Flink - Gordon
No ratings yet
ITHome - Deep Dive Into Apache Flink - Gordon
44 pages
Datastream Api: Fault Tolerance
No ratings yet
Datastream Api: Fault Tolerance
26 pages
Flink HandsOn
No ratings yet
Flink HandsOn
39 pages
5a - Streaming Data Analytics PDF
No ratings yet
5a - Streaming Data Analytics PDF
37 pages
An Empirical Study of The Out of Memory Errors in Apache Spark
No ratings yet
An Empirical Study of The Out of Memory Errors in Apache Spark
28 pages
Kubernetes and Real Time World Analytics Albert Lewandowski
No ratings yet
Kubernetes and Real Time World Analytics Albert Lewandowski
55 pages
Apache Flink
No ratings yet
Apache Flink
40 pages
20200706-WP-Optimizing Your Apache Kafka Deployment
No ratings yet
20200706-WP-Optimizing Your Apache Kafka Deployment
32 pages
SPARK Internals
No ratings yet
SPARK Internals
13 pages
20 Best Practices For Working With Apache Kafka at Scale - DZone Big Data
No ratings yet
20 Best Practices For Working With Apache Kafka at Scale - DZone Big Data
10 pages
BOSS16 Tutorial Flink
No ratings yet
BOSS16 Tutorial Flink
32 pages
Hyderabad Meetup Dec 7th 2024 - Diptiman - Confluent
No ratings yet
Hyderabad Meetup Dec 7th 2024 - Diptiman - Confluent
85 pages
Module 08 Flink - Stream Processing and Batch Processing Platform
No ratings yet
Module 08 Flink - Stream Processing and Batch Processing Platform
40 pages
System Design Interview Prep
No ratings yet
System Design Interview Prep
22 pages
Fairwinds Kubernetes Benchmark Report 2023
No ratings yet
Fairwinds Kubernetes Benchmark Report 2023
16 pages
5 Timing Diagram
No ratings yet
5 Timing Diagram
16 pages
Demystifying Flink Memory Allocation & Tuning
No ratings yet
Demystifying Flink Memory Allocation & Tuning
14 pages
BDA Unit III IV
No ratings yet
BDA Unit III IV
33 pages
RosellaC - Ramos-Topic 3-PHP Control Structures-Part 2
No ratings yet
RosellaC - Ramos-Topic 3-PHP Control Structures-Part 2
16 pages
Unit 4 Topic 5 Spark On YARN
No ratings yet
Unit 4 Topic 5 Spark On YARN
26 pages
LAB Sheet 4
No ratings yet
LAB Sheet 4
10 pages
Flink - Basics
No ratings yet
Flink - Basics
15 pages
WUBR 170GN P4 Approval Sheet 1.1
No ratings yet
WUBR 170GN P4 Approval Sheet 1.1
14 pages
Buyers Guide - Decoding The Top 4 Real-Time Data Platforms Powered by Apache Flink
No ratings yet
Buyers Guide - Decoding The Top 4 Real-Time Data Platforms Powered by Apache Flink
17 pages
VScan Cable Locator
No ratings yet
VScan Cable Locator
6 pages
LIST MANIPULATION Practical Live Class
No ratings yet
LIST MANIPULATION Practical Live Class
13 pages
Macintosh Classic - Old Crap Vintage Computing
No ratings yet
Macintosh Classic - Old Crap Vintage Computing
12 pages
Fairwinds The Guide To Kubernetes Cost 24
No ratings yet
Fairwinds The Guide To Kubernetes Cost 24
11 pages
Atc17 Misra
No ratings yet
Atc17 Misra
15 pages
Bluetooth Broadcasting: How Far Can We Go? An Experimental Study
No ratings yet
Bluetooth Broadcasting: How Far Can We Go? An Experimental Study
12 pages
BDA Notes (Unit-1)
No ratings yet
BDA Notes (Unit-1)
11 pages
Python 3 - Functions
No ratings yet
Python 3 - Functions
9 pages
How To Transfer PuTTY Sessions To Another Windows Machine
No ratings yet
How To Transfer PuTTY Sessions To Another Windows Machine
8 pages
Guide To Operators WP August 2022
No ratings yet
Guide To Operators WP August 2022
12 pages
Sala Questions
No ratings yet
Sala Questions
38 pages
Csa Overview
No ratings yet
Csa Overview
9 pages
Week 4 CC
No ratings yet
Week 4 CC
7 pages
Flink: Another Data Stream Framework!
No ratings yet
Flink: Another Data Stream Framework!
7 pages
Glossary - Apache-Flink
No ratings yet
Glossary - Apache-Flink
4 pages
Report
No ratings yet
Report
5 pages
Apache Kafka-Flink Syllabus
No ratings yet
Apache Kafka-Flink Syllabus
2 pages
4 - Information Technology (I.T.) Computer Organization & Architecture
No ratings yet
4 - Information Technology (I.T.) Computer Organization & Architecture
3 pages
Distributed File System and Scalable Computing
No ratings yet
Distributed File System and Scalable Computing
8 pages
Yozolog
No ratings yet
Yozolog
6 pages
Chapter 6 Spark and Flink Questions Answers
No ratings yet
Chapter 6 Spark and Flink Questions Answers
5 pages
First Steps - Apache - Flink
No ratings yet
First Steps - Apache - Flink
4 pages
Mawaporasirukinu
No ratings yet
Mawaporasirukinu
2 pages
Apache Flink On Confluent Cloud
No ratings yet
Apache Flink On Confluent Cloud
2 pages
Apache Flink Is An Open-Source, Dis
No ratings yet
Apache Flink Is An Open-Source, Dis
2 pages
Logistics General
No ratings yet
Logistics General
3 pages
Woker Fault Tolerance
No ratings yet
Woker Fault Tolerance
3 pages
DMC Card 00180020430723
No ratings yet
DMC Card 00180020430723
2 pages
To Do
No ratings yet
To Do
1 page

Common Flink Mistakes

Uploaded by

Common Flink Mistakes

Uploaded by

The Top 5 Mistakes Deploying Robert Metzger

DataStream<HashMap< String, Location >> s1 = ...

2 start co.decodable.talks.flink.performance.Location 11 22 end co.decodable.talks.flink.performance.Location 88 99

Map size 1st entry value type

DataStream< OptimizedLocation > s2 = ...

var clusterConfig = new MiniClusterConfiguration.Builder()

var env = new RemoteStreamEnvironment(clusterAddress.getHost(),

Session Mode Application Mode Per-Job Mode

JobManager JobManager JobManager

Job1 Job2 Job3 Recommended Job1

Passive Deployment Active Deployment

→ Flink talks to a resource manager

Pros / Cons: Pros / cons:

- Mistake: Under or over-provisioning of clusters for a given workload

- Do a back of the napkin calculation of your use-case in your environment

How much state are we checkpointing?

per machine: 40 bytes * 5 windows * 100,000,000 keys = 20 GB

How is the Window operator accessing state on disk?

per machine: 40 bytes * 5 windows * 200,000 msg/sec = 40 MB/s

- Most problems have been solved already online

You might also like