0% found this document useful (0 votes)

10 views39 pages

Flink HandsOn

Apache Flink is a distributed processing engine designed for stateful computations over both unbounded and bounded data streams, optimized for in-memory performance and scalability. It offers multiple APIs, including DataStream, Table API, and SQL, to facilitate batch and stream processing with various transformation functions. Key features include windowing support, control events like watermarks, and the ability to handle late data, making it suitable for real-time data analytics.

Uploaded by

drivesankofa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views39 pages

Flink HandsOn

Uploaded by

drivesankofa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Macroarea di Ingegneria

Dipartimento di Ingegneria Civile e Ingegneria Informatica

Apache Flink: Hands-on Session

A.A. 2021/22
Matteo Nardelli

Laurea Magistrale in Ingegneria Informatica - II anno

The reference Big Data stack

High-level Interfaces

Support / Integration
Data Processing

Data Storage

Resource Management

2
Matteo Nardelli - SABD 2021/22
Apache Flink
• Apache Flink is a framework and distributed processing
engine for stateful computations over unbounded and
bounded data streams.
• Unbounded streams: have a start but no defined end;
must be continuously processed; is not possible to wait
for all data to arrive.
• Stream processing
• Bounded streams: have defined start and end; can be
processed by ingesting all data before computation;
ordered ingestion is not usually required (can be sorted)
• Batch processing

• Flink has been designed to run in all common cluster

environments, perform computations at in-memory speed
and at any scale.
3
Matteo Nardelli - SABD 2021/22
Apache Flink
• Flink is designed to run stateful streaming
applications at any scale.
• Applications are parallelized into possibly thousands of
tasks that are distributed and concurrently executed in a
cluster.
• Leverage In-Memory Performance
• Stateful Flink applications are optimized for local state
access.

4
Matteo Nardelli - SABD 2021/22
Apache Flink
• Key concepts:
• Stream:
• bounded/unbounded;
• real-time/recorded
• State:
• Flink offers state primitives,
• pluggable state backends (e.g., RocksDB),
• exactly-once semantic,
• scalable applications (data partitioning and
distribution)
• Time:
• event-time vs processing-time mode;
• watermark;
• late data handling
5
Matteo Nardelli - SABD 2021/22
Apache Flink: APIs
• Multiple APIs at different levels of abstraction

6
Matteo Nardelli - SABD 2021/22
Apache Flink: ProcessFunction API
ProcessFunction API:
• Low-level stream processing operation
• Handles events by being invoked for each event
received
• Has access to (RuntimeContext):
• Events (stream elements)
• State (fault-tolerant, consistent, only on keyed
stream)
• Timers (event time and processing time, only on
keyed stream)

Sliding time window of

10 sec length and 5
sec slide

See https://bit.ly/2AhCEBX 8
V. Cardellini - SABD 2020/21
Apache Flink: DataStream API
DataStream API:
• Provides primitives for many common stream
processing operations:
• Windowing
• Record-at-a-time transformations
• Enriching events
• Based on functions, e.g., map(), reduce(), and
aggregate()

DataStream<Tuple2<String, Long>> result = words

.map(word -> Tuple2.of(word, 1L))
.returns(Types.TUPLE(Types.STRING, Types.LONG))
.keyBy(0)
.reduce((a, b) -> Tuple2.of(a.f0, a.f1 + b.f1));
9
Matteo Nardelli - SABD 2021/22
Apache Flink: Table API and SQL
Table API
• Table API and SQL are unified APIs for batch and
stream processing;
• They can be seamlessly integrated with the
DataStream and DataSet APIs;
• They support user-defined scalar, aggregate, and
table-valued functions.
• Relational APIs are designed to ease the definition of
data analytics, data pipelining, and ETL applications
Sessionize a clickstream and count the number of clicks per session

SELECT userId, COUNT(*)

FROM clicks
GROUP BY SESSION(clicktime, INTERVAL '30' MINUTE), userId
10
Matteo Nardelli - SABD 2021/22
Flink: APIs and libraries
• Batch processing applications: DataSet API
– Supports a wide range of data types beyond key/
value pairs and a wealth of operators
Core of PageRank algorithm using DataSet API

See https://bit.ly/2zEH3Pk 11
V. Cardellini - SABD 2020/21
Anatomy of a Flink program
• Let’s analyze DataStream API
https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html

• Special DataStream class used to represent a

collection of data in a Flink program
• Each Flink program consists of the same basic parts:
1. Obtain one execution environment

2. Load/create initial data

12
V. Cardellini - SABD 2020/21
Anatomy of a Flink program
3. Specify transformations on data by calling methods on
DataStream

4. Specify where to put the results of your computations

5. Trigger the program execution by calling execute on

StreamExecutionEnvironment

13
V. Cardellini - SABD 2020/21
Flink: Lazy evaluation

• Flink programs are executed lazily

– When program’s main method is executed, data
loading and transformations do not happen directly
– Rather, each operation is created and added to
program’s plan
– Operations are actually executed when execution
is explicitly triggered by calling execute() on the
execution environment

14
V. Cardellini - SABD 2020/21
Flink: data sources
• Several predefined stream sources accessible from the
StreamExecutionEnvironment
1. File-based:
– E.g., readTextFile(path) to read text files
– Flink splits file reading process into two sub-tasks: directory monitoring and
data reading
• Monitoring is implemented by a single, non-parallel task, while reading is
performed by multiple tasks running in parallel, whose parallelism is equal to
the job parallelism
2. Socket-based
3. Collection-based
4. Custom
– E.g., to read from Kafka fromSource(new KafkaSource<…>(…))
https://nightlies.apache.org/flink/flink-docs-stable/docs/connectors/
datastream/kafka/
– See Apache Bahir for streaming connectors and SQL data sources https://
bahir.apache.org/

15
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Map
DataStream → DataStream
– Example: double the values of the input stream

• FlatMap
DataStream → DataStream
– Example: split sentences to words

16
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Filter
DataStream → DataStream
– Example: filter out zero values

• KeyBy
DataStream → KeyedStream
– To specify a key that logically partitions a stream into disjoint partitions
– Internally, implemented with hash partitioning
– Different ways to specify keys, the simplest case is grouping tuples on one
or more fields of the tuple
– Examples:

17
V. Cardellini - SABD 2020/21
Flink: DataStream transformations

• Reduce
KeyedStream → DataStream
– “Rolling” reduce on a keyed data stream
– Combines the current element with the last reduced value and emits
the new value
– Example: create a stream of partial sums

18
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Aggregations
KeyedStream → DataStream
– To aggregate on a keyed data stream
– min returns the minimum value, whereas minBy returns the element that
has the minimum value in this field

• Window
KeyedStream → WindowedStream

19
V. Cardellini - SABD 2020/21
Flink: DataStream transformations
• Other transformations available in Flink
– join: joins two data streams on a given key
– union: union of two or more data streams creating a new
stream containing all the elements from all the streams
– split: splits the stream into two or more streams
according to some criterion
– iterate: creates a “feedback” loop in the flow, by
redirecting the output of one operator to some previous
operator
• Useful for algorithms that continuously update a model

See https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/
operators/overview/

20
V. Cardellini - SABD 2020/21
Example: streaming window WordCount
• Count the words from a web socket in 5 sec windows

// Key by the first element of a Tuple

21
V. Cardellini - SABD 2020/21
Example: streaming window WordCount

22
V. Cardellini - SABD 2020/21
Flink: windows support
• Windows can be applied either to keyed streams or to
non-keyed ones
• General structure of a windowed Flink program

23
V. Cardellini - SABD 2020/21
Flink: window lifecycle
• First, specify if stream is keyed or not and define the
window assigner
– Keyed stream allows to perform the windowed computation in
parallel by multiple tasks
– The window is completely removed when the time (event or
processing time) passes its end timestamp plus the user-specified
allowed lateness

• Then, associate to window its trigger, (evictor) and function

– Trigger determines when a window is ready to be processed by the
window function
– Evictor (optional) has the ability to remove elements from a window
after the trigger fires and before and/or after the window function is
applied
– Function specifies the computation to be applied to the window
contents
Read more: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/windows/
24
V. Cardellini - SABD 2020/21
Flink: window assigners
• How elements are assigned to windows
• Support for different window assigners
– Each WindowAssigner comes with a default Trigger
• Built-in assigners for most common use cases:
– Tumbling windows
– Sliding windows
– Session windows
– Global windows
• Except for global windows, they assign elements to
windows based on time, which can either be processing
time or event time
• It is also possible to implement a custom window assigner

25
V. Cardellini - SABD 2020/21
Flink: window assigners
• Session windows
– To group elements by sessions of
activity
– Differently from tumbling and sliding
windows, do not overlap and do not
have a fixed start and end time
– A session window closes when a
gap of inactivity occurs
• Global windows
– To assign all elements with the
same key to the same single global
window
– Only useful if you also specify a
custom trigger

26
V. Cardellini - SABD 2020/21
Flink: window functions

• Different window functions to specify the computation

on each window

• ReduceFunction
– To incrementally aggregate the elements of a window
– Example: sum up the second fields of the tuples for all elements in a
window

27
V. Cardellini - SABD 2020/21
Flink: window functions
• AggregateFunction: generalized version of a ReduceFunction
– Example: compute average of the elements in the window

28
V. Cardellini - SABD 2020/21
Flink: window functions
• AggregateFunction
– Example: compute weighted average of the elements in the window

29
V. Cardellini - SABD 2020/21
Flink: window functions
• ProcessWindowFunction: gets an Iterable containing all
the elements of the window, and a Context object with access to
time and state information
✓ More flexibility than other window functions
✗ At the cost of performance and resource consumption: elements are
buffered until the window is ready for processing

30
V. Cardellini - SABD 2020/21
Flink: window functions
• ProcessWindowFunction: gets an Iterable containing all
the elements of the window, and a Context object with access to
time and state information
✓ More flexibility than other window functions
✗ At the cost of performance and resource consumption: elements are
buffered until the window is ready for processing

• ReduceFunction and AggregateFunction can execute

more efficiently
– Flink can incrementally aggregate the elements for each
window as they arrive

31
V. Cardellini - SABD 2020/21
Flink: control events

• Control events: special events injected in the

data stream by operators

• Two types of control events in Flink

⎼ Watermarks
⎼ Checkpoint barriers

32
V. Cardellini - SABD 2020/21
Flink: watermarks
• Watermarks mark the progress of event time within a
data stream
• Flow as part of data stream and carry a timestamp t
– W(t) declares that event time
has reached time t in that
stream, meaning that there
should be no more elements
with timestamp t’ <= t
– Crucial for out-of-order
streams, where events are not
ordered by their timestamps

• Execution plan can be visualized

37
V. Cardellini - SABD 2020/21
Flink: application monitoring
• Built-in monitoring and metrics system
• Allows gathering and exposing metrics to external systems
• Built-in metrics include
– Throughput: in terms of number of records per sec. (per operator/
task)
– Latency
• Support for latency tracking: special markers (called LatencyMarker)
are periodically inserted at all sources in order to obtain a distribution
of latency between sources and each downstream operator
– But do not account for time spent in operator processing (or in
window buffers)
– Assume that all machines clocks are sync
– Used JVM heap/non-heap/direct memory
– Availability, checkpointing

38
V. Cardellini - SABD 2020/21
Flink: application monitoring
• Application-specific metrics can be added
– E.g., counters for number of invalid records
• All metrics can be
– queried via Flink’s Monitoring REST API
– visualized in Flink’s Dashboard (Metrics tab)
– or send to external systems (e.g., Graphite and InfluxDB)

See https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html

39
V. Cardellini - SABD 2020/21

Apache Flink Introduction - Big Data Landscape
No ratings yet
Apache Flink Introduction - Big Data Landscape
26 pages
Stream Processing - Hands-On With Apache Flink (Giannis Polyzos) (Z-Library)
No ratings yet
Stream Processing - Hands-On With Apache Flink (Giannis Polyzos) (Z-Library)
234 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Smart Data Boden Introduction Flink
No ratings yet
Smart Data Boden Introduction Flink
37 pages
Chapter 7 Flink Stream and Batch Processing in A Single Engine
No ratings yet
Chapter 7 Flink Stream and Batch Processing in A Single Engine
45 pages
Apache Flink.9443699.Powerpoint
No ratings yet
Apache Flink.9443699.Powerpoint
6 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
Cracking/Money Making Guide V 1.1: Key Programs
83% (6)
Cracking/Money Making Guide V 1.1: Key Programs
11 pages
Apache Flink ™: Stream and Batch Processing in A Single Engine
No ratings yet
Apache Flink ™: Stream and Batch Processing in A Single Engine
11 pages
Ajur, Sabace DMC
100% (2)
Ajur, Sabace DMC
108 pages
Apache Flink Tutorial
100% (1)
Apache Flink Tutorial
44 pages
5a - Streaming Data Analytics PDF
No ratings yet
5a - Streaming Data Analytics PDF
37 pages
Stream Processing Hands On With Apache Flink Free Lms Version
No ratings yet
Stream Processing Hands On With Apache Flink Free Lms Version
232 pages
Spark Streaming
No ratings yet
Spark Streaming
99 pages
Hyderabad Meetup Dec 7th 2024 - Diptiman - Confluent
No ratings yet
Hyderabad Meetup Dec 7th 2024 - Diptiman - Confluent
85 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
Unit 4 BDTT
No ratings yet
Unit 4 BDTT
23 pages
Flink - Basics
No ratings yet
Flink - Basics
15 pages
Module 08 Flink - Stream Processing and Batch Processing Platform
No ratings yet
Module 08 Flink - Stream Processing and Batch Processing Platform
40 pages
ITHome - Deep Dive Into Apache Flink - Gordon
No ratings yet
ITHome - Deep Dive Into Apache Flink - Gordon
44 pages
Apache SD Papers
No ratings yet
Apache SD Papers
21 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
8 - Streaming 3 - Spark Flink
No ratings yet
8 - Streaming 3 - Spark Flink
52 pages
Apache Flink
No ratings yet
Apache Flink
40 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
Lightweight Asynchronous Snapshots For Distributed Dataflows (Flink)
No ratings yet
Lightweight Asynchronous Snapshots For Distributed Dataflows (Flink)
8 pages
Lec 20
No ratings yet
Lec 20
25 pages
BOSS16 Tutorial Flink
No ratings yet
BOSS16 Tutorial Flink
32 pages
BDA Notes (Unit-1)
No ratings yet
BDA Notes (Unit-1)
11 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Chapter 6 Spark and Flink Questions Answers
No ratings yet
Chapter 6 Spark and Flink Questions Answers
5 pages
Report
No ratings yet
Report
5 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Apache Flink® Training: Intro
No ratings yet
Apache Flink® Training: Intro
37 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
Flink: Another Data Stream Framework!
No ratings yet
Flink: Another Data Stream Framework!
7 pages
Glossary - Apache-Flink
No ratings yet
Glossary - Apache-Flink
4 pages
Continuous Processing With Apache Flink: Stephan Ewen @stephanewen
No ratings yet
Continuous Processing With Apache Flink: Stephan Ewen @stephanewen
41 pages
ACONEX - Issue New Drawings Procedure (M City Example)
No ratings yet
ACONEX - Issue New Drawings Procedure (M City Example)
8 pages
TEACHING GUIDE E-Tech
No ratings yet
TEACHING GUIDE E-Tech
4 pages
Flink: Big Data Huawei Course
No ratings yet
Flink: Big Data Huawei Course
22 pages
Apache Flink Is An Open-Source, Dis
No ratings yet
Apache Flink Is An Open-Source, Dis
2 pages
02data Stream Processing With Apache Flink
No ratings yet
02data Stream Processing With Apache Flink
61 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
Optimizing Flink For High-Throughput Machine Learning: Streaming Feature Engineering in Banking
No ratings yet
Optimizing Flink For High-Throughput Machine Learning: Streaming Feature Engineering in Banking
10 pages
Apache Flink
No ratings yet
Apache Flink
116 pages
Mawaporasirukinu
No ratings yet
Mawaporasirukinu
2 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Unified Batch and Real Time Stream Processing
No ratings yet
Unified Batch and Real Time Stream Processing
68 pages
InPowert Lite G-Drive Users Guide
No ratings yet
InPowert Lite G-Drive Users Guide
174 pages
Quantitative Economics With Python
No ratings yet
Quantitative Economics With Python
543 pages
Post BIM Execution Plan Guid
No ratings yet
Post BIM Execution Plan Guid
6 pages
Spring Kafka Reference
No ratings yet
Spring Kafka Reference
226 pages
Apache Kafka-Flink Syllabus
No ratings yet
Apache Kafka-Flink Syllabus
2 pages
Blender
No ratings yet
Blender
13 pages
6 Programming Fundamentals (1) - 1
No ratings yet
6 Programming Fundamentals (1) - 1
105 pages
Unit II - Parametric & Non-Parametric Tests
100% (1)
Unit II - Parametric & Non-Parametric Tests
81 pages
Java Exercise
No ratings yet
Java Exercise
22 pages
Apache-Kafka Bernhard-H Oss 2018
No ratings yet
Apache-Kafka Bernhard-H Oss 2018
35 pages
GSM Key Performance Indicator KPI Guidebook: Security Category: Secret Serviceable Range: Radio Planning Section
No ratings yet
GSM Key Performance Indicator KPI Guidebook: Security Category: Secret Serviceable Range: Radio Planning Section
13 pages
02 - Docker
No ratings yet
02 - Docker
34 pages
Kafka Overview
No ratings yet
Kafka Overview
51 pages
Manual Didáctico de Mathematica 5.2
No ratings yet
Manual Didáctico de Mathematica 5.2
22 pages
HANA Overview
No ratings yet
HANA Overview
69 pages
ECO2147 - Asgm1 - Summer2025V3 (1) - 1
No ratings yet
ECO2147 - Asgm1 - Summer2025V3 (1) - 1
8 pages
Mind Over Machine - The Power of Human I
No ratings yet
Mind Over Machine - The Power of Human I
6 pages
Managing Information Technology in A New Age
No ratings yet
Managing Information Technology in A New Age
18 pages
Data Acquisition Catalog en
No ratings yet
Data Acquisition Catalog en
17 pages
Nancy
No ratings yet
Nancy
90 pages
QLICKVIEW
No ratings yet
QLICKVIEW
11 pages
Module 4 - Functions Modules and Packages
No ratings yet
Module 4 - Functions Modules and Packages
12 pages
Java Test
No ratings yet
Java Test
2 pages
Mastercam Help
No ratings yet
Mastercam Help
43 pages
Computer Project Programs
No ratings yet
Computer Project Programs
26 pages
IBM System 360/370/390: Section 3.2 CISC
No ratings yet
IBM System 360/370/390: Section 3.2 CISC
62 pages
FeedFront Magazine, Issue 13
No ratings yet
FeedFront Magazine, Issue 13
72 pages
M365 Fundamentals Learning Path (July 2019) PDF
No ratings yet
M365 Fundamentals Learning Path (July 2019) PDF
1 page
IT6011
No ratings yet
IT6011
7 pages
Orthogonal Biorthogonal and Simplex Signals
No ratings yet
Orthogonal Biorthogonal and Simplex Signals
17 pages
Ce Mark - Application Form
No ratings yet
Ce Mark - Application Form
3 pages
Source Code To Edith
No ratings yet
Source Code To Edith
8 pages
LG N4B1 HDD Compatibility 20090623
No ratings yet
LG N4B1 HDD Compatibility 20090623
1 page
Mastering Terraform A Comprehensive Guide to Infrastructure As Code
From Everand
Mastering Terraform A Comprehensive Guide to Infrastructure As Code
Mario Marinov
No ratings yet
Study Guide Cisco 300-535 SPAUTO Automating and Programming Cisco Service Provider Solutions
From Everand
Study Guide Cisco 300-535 SPAUTO Automating and Programming Cisco Service Provider Solutions
Anand Vemula
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Professional Heroku Programming
From Everand
Professional Heroku Programming
Chris Kemp
4/5 (2)
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
FLTK Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
FLTK Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Real-Time Data Integration: Apache Kafka and Spark Streaming Techniques
From Everand
Advanced Real-Time Data Integration: Apache Kafka and Spark Streaming Techniques
Adam Jones
No ratings yet
LangChain Essentials: From Basics to Advanced AI Applications
From Everand
LangChain Essentials: From Basics to Advanced AI Applications
Robert Johnson
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Relayd and Httpd Mastery: IT Mastery, #11
From Everand
Relayd and Httpd Mastery: IT Mastery, #11
Michael W. Lucas
No ratings yet
Essential Apache Beam: Definitive Reference for Developers and Engineers
From Everand
Essential Apache Beam: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C# for Beginners: Learn in 24 Hours
From Everand
C# for Beginners: Learn in 24 Hours
Alex Nordeen
No ratings yet

Flink HandsOn

Uploaded by

Flink HandsOn

Uploaded by

Macroarea di Ingegneria

Dipartimento di Ingegneria Civile e Ingegneria Informatica

Apache Flink: Hands-on Session

Laurea Magistrale in Ingegneria Informatica - II anno

• Flink has been designed to run in all common cluster

Read more: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/process_function/

Sliding time window of

DataStream<Tuple2<String, Long>> result = words

SELECT userId, COUNT(*)

• Special DataStream class used to represent a

2. Load/create initial data

4. Specify where to put the results of your computations

5. Trigger the program execution by calling execute on

• Flink programs are executed lazily

// Key by the first element of a Tuple

• Then, associate to window its trigger, (evictor) and function

• Different window functions to specify the computation

• ReduceFunction and AggregateFunction can execute

• Control events: special events injected in the

• Two types of control events in Flink

Read more: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/event-time/generating_watermarks/

• Execution plan can be visualized

You might also like